1 Introduction

The Split-\(\widehat{R}\) statistic and the effective sample size \(S_{\rm eff}\) (previously called \(N_{\rm eff}\) or \(n_{\rm eff}\)) are routinely used to monitor the convergence of iterative simulations, which are omnipresent in Bayesian statistics in the form of Markov-Chain Monto-Carlo samples. The original \(\widehat{R}\) statistic (Gelman and Rubin, 1992; Brooks and Gelman, 1998) and split-\(\widehat{R}\) (Gelman et al., 2013) are both based on the ratio of between and within-chain marginal variances of the simulations, while the latter is computed from split chains (hence the name).

\(\widehat{R}\), split-\(\widehat{R}\), and \(S_{\rm eff}\) are well defined only if the marginal distributions have finite mean and variance. Even if that’s the case, their estimates are less stable for distributions with long tails. To alleviate these problems, we define split-\(\widehat{R}\) and \(S_{\rm eff}\) using rank normalized values, empirical cumulative density functions, and small posterior intervals.

The code for the new split-\(\widehat{R}\) and \(S_{\rm eff}\) versions and for the corresponding diagnostic plots can be found in monitornew.R and monitorplot.R, respectively. An extended case study with more details and experiments is available at rhat_reff_extra.

2 Review of split-\(\widehat{R}\) and effective sample size

In this section, we will review the split-\(\widehat{R}\) and effective sample size estimates as implemented in Stan 2.18 (Stan Development Team, 2018e). These implementations represent the current de facto standard of convergence diagnostics for iterative simulations.

2.1 Split-\(\widehat{R}\)

Below, we present the computation of Split-\(\widehat{R}\) following Gelman et al. (2013), but using the notation style of Stan Development Team (2018a). Our general recommendation is to always run several chains. \(N\) is the number of draws per chain, \(M\) is the number of chains, and \(S=MN\) is the total number of draws from all chains. For each scalar summary of interest \(\theta\), we compute \(B\) and \(W\), the between- and within-chain variances:

\[ B = \frac{N}{M-1}\sum_{m=1}^{M}(\overline{\theta}^{(.m)}-\overline{\theta}^{(..)})^2, \;\mbox{ where }\;\;\overline{\theta}^{(.m)}=\frac{1}{N}\sum_{n=1}^N \theta^{(nm)},\;\; \;\;\overline{\theta}^{(..)} = \frac{1}{M}\sum_{m=1}^M\overline{\theta}^{(.m)} \\ W = \frac{1}{M}\sum_{m=1}^{M}s_j^2, \;\mbox{ where }\;\; s_m^2=\frac{1}{N-1} \sum_{n=1}^N (\theta^{(nm)}-\overline{\theta}^{(.m)})^2. \]

The between-chain variance, \(B\), also contains the factor \(N\) because it is based on the variance of the within-chain means, \(\overline{\theta}^{(.m)}\), each of which is an average of \(N\) values \(\theta^{(nm)}\).

We can estimate \(\mbox{var}(\theta \mid y)\), the marginal posterior variance of the estimand, by a weighted average of \(W\) and \(B\), namely \[ \widehat{\mbox{var}}^+(\theta \mid y)=\frac{N-1}{N}W + \frac{1}{N}B. \] This quantity overestimates the marginal posterior variance assuming the starting distribution of the simulations is appropriately overdispersed compared to the target distribution, but is unbiased under stationarity (that is, if the starting distribution equals the target distribution), or in the limit \(N\rightarrow\infty\). To have an overdispersed starting distribution, independent Markov chains should be initialized with diffuse starting values for the parameters.

Meanwhile, for any finite \(N\), the within-chain variance \(W\) should underestimate \(\mbox{var}(\theta \mid y)\) because the individual chains haven’t had the time to explore all of the target distribution and, as a result, will have less variability. In the limit as \(N\rightarrow\infty\), the expectation of \(W\) also approaches \(\mbox{var}(\theta \mid y)\).

We monitor convergence of the iterative simulations to the target distribution by estimating the factor by which the scale of the current distribution for \(\theta\) might be reduced if the simulations were continued in the limit \(N\rightarrow\infty\). This potential scale reduction is estimated as \[ \widehat{R}= \sqrt{\frac{\widehat{\mbox{var}}^+(\theta \mid y)}{W}}, \] which declines to 1 as \(N\rightarrow\infty\). We call this split-\(\widehat{R}\) because we are applying it to chains that have been split in half so that \(M\) is twice the number of actual chains. Without splitting, \(\widehat{R}\) would get fooled by non-stationary chains.

We note that split-\(\widehat{R}\) is also well defined for sequences that are not Markov-chains. However, for simplicity, we always refer to ‘chains’ instead of more generally to ‘sequences’ as the former is our primary use case for \(\widehat{R}\)-like measures.

2.2 Effective sample size \(S_{\rm eff}\)

If the \(N\) simulation draws within each chain were truly independent, the between-chain variance \(B\) would be an unbiased estimate of the posterior variance, \(\mbox{var}(\theta \mid y)\), and we would have a total of \(S = M \times N\) independent simulations from the \(M\) chains. In general, however, the simulations of \(\theta\) within each chain will be autocorrelated, and thus \(B\) will be larger than \(\mbox{var}(\theta \mid y)\), in expectation.

A nice introductory reference for analyzing MCMC results in general and effective sample size in particular is provided by Geyer (2011, see also 1992). The particular calculations used by Stan (Stan Development Team, 2018e) follow those for split-\(\widehat{R}\), which involve both between-chain (mean) and within-chain calculations (autocorrelation). They were introduced in the Stan manual (Stan Development Team, 2018d) and explained in more detail in Gelman et al. (2013).

One way to define effective sample size for correlated simulation draws is to consider the statistical efficiency of the average of the simulations \(\bar{\theta}^{(..)}\) as an estimate of the posterior mean \(\mbox{E}(\theta \mid y)\). This can be a reasonable baseline even though it is not the only possible summary and might be inappropriate, for example, if there is particular interest in accurate representation of low-probability events in the tails of the distribution.

The effective sample size of a chain is defined in terms of the autocorrelations within the chain at different lags. The autocorrelation \(\rho_t\) at lag \(t \geq 0\) for a chain with joint probability function \(p(\theta)\) with mean \(\mu\) and variance \(\sigma^2\) is defined to be \[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} (\theta^{(n)} - \mu) (\theta^{(n+t)} - \mu) \, p(\theta) \, d\theta. \] This is just the correlation between the two chains offset by \(t\) positions. Because we know \(\theta^{(n)}\) and \(\theta^{(n+t)}\) have the same marginal distribution in an MCMC setting, multiplying the two difference terms and reducing yields \[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} \theta^{(n)} \, \theta^{(n+t)} \, p(\theta) \, d\theta. \]

The effective sample size of one chain generated by a process with autocorrelations \(\rho_t\) is defined by \[ N_{\rm eff} \ = \ \frac{N}{\sum_{t = -\infty}^{\infty} \rho_t} \ = \ \frac{N}{1 + 2 \sum_{t = 1}^{\infty} \rho_t}. \]

Effective sample size \(N_{\rm eff}\) can be larger than \(N\) in case of antithetic Markov chains, which have negative autocorrelations on odd lags. The Dynamic Hamiltonian Monte-Carlo algorithms used in Stan (Hoffman and Gelman, 2014,@betancourt2017conceptual) can produce \(N_{\rm eff}>N\) for parameters with a close to Gaussian posterior (in the unconstrained space) and low dependency on other parameters.

2.2.1 Estimation of the Effective Sample Size

In practice, the probability function in question cannot be tractably integrated and thus neither autocorrelation nor the effective sample size can be calculated. Instead, these quantities must be estimated from the samples themselves. The rest of this section describes an autocorrelation and split-\(\widehat{R}\) based effective sample size estimator, based on multiple split chains. For simplicity, each chain will be assumed to be of the same length \(N\).

Stan carries out the autocorrelation computations for all lags simultaneously using Eigen’s fast Fourier transform (FFT) package with appropriate padding; see Geyer (2011) for more details on using FFT for autocorrelation calculations. The autocorrelation estimates \(\hat{\rho}_{t,m}\) at lag \(t\) from multiple chains \(m \in (1,\ldots,M)\) are combined with the within-chain variance estimate \(W\) and the multi-chain variance estimate \(\widehat{\mbox{var}}^{+}\) introduced in the previous section to compute the combined autocorrelation at lag \(t\) as \[ \hat{\rho}_t = 1 - \frac{\displaystyle W - \textstyle \frac{1}{M}\sum_{m=1}^M \hat{\rho}_{t,j}}{\widehat{\mbox{var}}^{+}}. \label{rhohat} \] If the chains have not converged, the variance estimator \(\widehat{\mbox{var}}^{+}\) will overestimate the true marginal variance which leads to an overestimation of the autocorrelation and an underestimation of the effective sample size.

Because of noise in the correlation estimates \(\hat{\rho}_t\) increases as \(t\) increases, typically the truncated sum of \(\hat{\rho}_t\) is used. Negative autocorrelations can happen only on odd lags and by summing over pairs starting from lag \(t=0\), the paired autocorrelation is guaranteed to be positive, monotone and convex modulo estimator noise (Geyer, 1992, 2011). Stan 2.18 uses Geyer’s initial monotone sequence criterion. The effective sample size of combined chains is defined as \[ S_{\rm eff} = \frac{N \, M}{\hat{\tau}}, \] where \[ \hat{\tau} = 1 + 2 \sum_{t=1}^{2k+1} \hat{\rho}_t = -1 + 2 \sum_{t'=0}^{k} \hat{P}_{t'}, \] and \(\hat{P}_{t'}=\hat{\rho}_{2t'}+\hat{\rho}_{2t'+1}\). The initial positive sequence estimator is obtained by choosing the largest \(k\) such that \(\hat{P}_{t'}>0\) for all \(t' = 1,\ldots,k\). The initial monotone sequence estimator is obtained by further reducing \(\hat{P}_{t'}\) to the minimum of the preceding values so that the estimated sequence becomes monotone.

The effective sample size \(S_{\rm eff}\) described here is different from similar formulas in the literature in that we use multiple chains and between-chain variance in the computation, which typically gives us more conservative claims (lower values of \(S_{\rm eff}\)) compared to single chain estimates, especially when mixing of the chains is poor. If the chains are not mixing at all (e.g., the posterior is multimodal and the chains are stuck in different modes), then our \(S_{\rm eff}\) is close to the number of chains.

Before version 2.18, Stan used a slightly incorrect initial sequence which implied that \(S_{\rm eff}\) was capped at \(S\) and thus the effective sample size was underestimated for some models. As antithetic behavior (i.e., \(S_{\rm eff} > S\)) is not that common or the effect is small, and capping at \(S\) can be considered to be pessimistic, this had negligible effect on any inference. However, it may have led to an underestimation of Stan’s efficiency when compared to other packages performing MCMC sampling.

3 Rank normalized split-\(\widehat{R}\) and relative efficiency estimates

As split-\(\widehat{R}\), and \(S_{\rm eff}\) are well defined only if the marginal posteriors have finite mean and variance, we next introduce variants which are well defined for all distributions and more robust for long tailed distributions.

3.1 Rank normalization

  1. Rank: Replace each value \(\theta^{(nm)}\) by its rank \(r^{(nm)}\). Average rank for ties are used to conserve the number of unique values of discrete quantities. Ranks are computed jointly for all draws from all chains.
  2. Normalize: Normalize ranks by inverse normal transformation \(z^{(nm)} = \phi^{-1}((r^{(nm)}-1/2)/S)\).

For continuous variables and \(S \rightarrow \infty\), the rank normalized values are normally distributed and rank normalization performs non-parametric transformation to normal distribution. Using normalized ranks instead of ranks directly, has the benefit that the behavior of \(\widehat{R}\) and \(S_{\rm eff}\) do not change for normally distributed \(\theta\).

3.2 Rank normalized split-\(\widehat{R}\)

Rank normalized split-\(\widehat{R}\) is computed using the equations in Section Split-\(\widehat{R}\) by replacing the original parameter values \(\theta^{(nm)}\) with their corresponding rank normalized values \(z^{(nm)}\).

3.3 Rank normalized folded-split-\(\widehat{R}\)

Both original and rank-normalized split-\(\widehat{R}\) can be fooled if chains have different scales but the same location. To alleviate this problem, we propose to compute rank normalized folded-split-\(\widehat{R}\) using folded split chains by rank normalizing absolute deviations from median \[ {\rm abs}(\theta^{(nm)}-{\rm median}(\theta)). \]

To obtain a single conservative \(\widehat{R}\) estimate, we propose to report the maximum of rank normalized split-\(\widehat{R}\) and rank normalized folded-split-\(\widehat{R}\) for each parameter.

3.4 Relative efficiency using rank normalized values

In addition to using rank-normalized values for convergence diagnostics via \(\widehat{R}\), we can also compute the corresponding effective sample size. This estimate will be well defined even if the original distribution does not have finite mean and variance. It is not directly applicable to estimate the Monte Carlo error of the mean of the original values, but it will provide a bijective transformation-invariant estimate of the mixing efficiency of chains. For simplicity we propose to report relative efficiency values \[ R_{\rm eff}=\textit{rank-normalized-split-}S_{\rm eff} / S, \] where \(\textit{split-}S_{\rm eff}\) is computed using equations in Section Effective sample size by replacing parameter values \(\theta^{(nm)}\) with rank normalized values \(z^{(nm)}\). For parameters with a close to normal distribution, the difference to using the original values is small. However, for parameters with a distribution far from normal, rank normalization can be seen as a near optimal non-parametric transformation.

The relative efficiency estimate using rank normalized values is a useful measure for relative efficiency of estimating the bulk (mean and quantiles near the median) of the distribution, and as shorthand term we use term bulk relative efficiency (bulk-\(R_{\rm eff}\)).

We propose to compute the relative efficiency also using folded split chains by rank normalizing absolute deviations from median (see above), which is a useful measure for the relative efficiency of estimating the distributions’ tail. As a shorthand, we use the term tail relative efficiency (tail-\(R_{\rm eff}\)).

3.5 Relative efficiency of the cumulative distribution function

The bulk and tail relative efficiency measures introduced above are useful as overall efficiency measures. Next, we introduce relative efficiency estimates of the cumulative distribution function (CDF), and later we use that to introduce relative efficiency diagnostics of quantiles and local small probability intervals.

Quantiles and posterior intervals derived on their basis are often reported quantities which are easy to estimate from posterior draws. Estimating the relative efficiency of such quantiles thus has a high practical relevance in particular as we observe the relative efficiency for tail quantiles to often be lower than for the mean or median. The \(\alpha\)-quantile is defined as the parameter value \(\theta_\alpha\) for which \(p(\theta \leq \theta_\alpha) = \alpha\). An estimate \(\hat{\theta}_\alpha\) of \(\theta_\alpha\) can thus be obtained by finding the \(\alpha\)-quantile of the empirical CDF (ECDF) of the posterior draws \(\theta^{(s)}\). However, quantiles cannot be written as an expectation, and thus the above equations for \(\widehat{R}\) and \(S_{\rm eff}\) are not directly applicable. Thus, we first focus on the relative efficiency of the cumulative probability \(p(\theta \leq \theta_\alpha)\) for different values of \(\theta_\alpha\).

For any \(\theta_\alpha\), the ECDF gives an estimate of the cumulative probability \[ p(\theta \leq \theta_\alpha) \approx \bar{I}_\alpha = \frac{1}{S}\sum_{s=1}^S I(\theta^{(s)} \leq\theta_\alpha), \] where \(I()\) is the indicator function. The indicator function transforms simulation draws to 0’s and 1’s, and thus the subsequent computations are bijectively invariant. Efficiency estimates of the ECDF at any \(\theta_\alpha\) can now be obtained by applying rank-normalizing and subsequent computations directly on the indictor function’s results.

3.6 Relative efficiency of quantiles

Assuming that we know the CDF to be a certain continuous function \(F\) which is smooth near an \(\alpha\)-quantile of interest, we could use the delta method to compute a variance estimate for \(F^{-1}(\bar{I}_\alpha)\). Although we don’t usually know \(F\), the delta method approach reveals that the variance of \(\bar{I}_\alpha\) for some \(\theta_\alpha\) is scaled by the (usually unknown) density \(f(\theta_\alpha)\), but the relative efficiency depends only on the relative efficiency of \(\bar{I}_\alpha\). Thus, we can use the relative efficiency of the ECDF computed via the indicator function \(I(\theta^{(s)} \leq \theta_\alpha)\) also for the corresponding quantile estimates.

3.7 Relative efficiency of median and MAD

Since the marginal posterior distributions might not have finite mean and variance, by default RStan (Stan Development Team, 2018c) and RStanARM (Stan Development Team, 2018b) report median and median absolute deviation (MAD) instead of mean and standard error (SE). Median and MAD are well defined even when the marginal distribution does not have finite mean and variance. Since the median is just 50%-quantile, we can estimate its relative efficiency as for any other quantile.

We can also compute the relative efficiency for the median absolute deviation (MAD), by computing the relative efficiency for the median of absolute deviations from the median of all draws. The absolute deviations from the median of all draws are same as previously defined for folded samples \[ {\rm abs}(\theta^{(nm)}-{\rm median}(\theta)). \] We see that the relative efficiency of MAD is obtained by using the same approach as for the median (and other quantiles) but with the folded values also used in rank-normalized-folded-split-\(S_{\rm eff}\).

3.8 Monte Carlo error estimates for quantiles

Previously, Stan has reported Monte Carlo standard error estimates for the mean of a quantity. This is valid only if the corresponding distribution has finite mean and variance; and even if valid, it may be easier and more robust to focus on the median and other quantiles, instead.

Median, MAD and quantiles are well defined even when the distribution does not have finite mean and variance, and they are asymptotically normal for continuous distributions which are non-zero in the relevant neighborhood. As the delta method for computing the variance would require explicit knowledge of the normalized posterior density, we propose the following alternative approach:

  1. Compute quantiles of the \({\rm Beta}(R_{\rm eff} \bar{I}_\alpha+1, R_{\rm eff}(1-\bar{I}_\alpha)+1)\) distribution. Including \(R_{\rm eff}\) takes into account the relative efficiency of the posterior draws.
  2. Find indices in \(\{1,\ldots,S\}\) closest to the ranks of these quantiles. For example, for quantile \(Q\), find \(s = {\rm round(Q S)}\).
  3. Use the corresponding \(\theta^{(s)}\) from the list of sorted posterior draws as quantiles from the error distribution. These quantiles can be used to approximate the Monte Carlo standard error.

3.9 Relative efficiency of small interval probability estimates

We can get more local relative efficiency estimates by considering small probability intervals. We propose to compute the relative efficiencies for \[ \bar{I}_{\alpha,\delta} = p(\hat{Q}_\alpha < \theta \leq \hat{Q}_{\alpha+\delta}), \] where \(\hat{Q}_\alpha\) is an empirical \(\alpha\)-quantile, \(\delta=1/k\) is the length of the interval with some positive integer \(k\), and \(\alpha \in (0,\delta,\ldots,1-\delta)\) changes in steps of \(\delta\). Each interval has \(S/k\) draws, and the efficiency measures autocorrelation of an indicator function which is \(1\) when the values are inside the specific interval and \(0\) otherwise. This gives us a local efficiency measure which does not depend on the shape of the distribution.

3.10 Rank plots

In addition to using rank-normalized values to compute split-\(\widehat{R}\), we propose to use rank plots for each chain instead of trace plots. Rank plots are nothing else than histograms of the ranked posterior samples (ranked over all chains) plotted separately for each chain. If rank plots of all chains look similar, this indicates good mixing of the chains. As compared to trace plots, rank plots don’t tend to squeeze to a mess in case of long chains.

3.11 Proposed changes in Stan

The proposal is to switch in Stan

  • from split-\(\widehat{R}\) (sRhat) to the maximum of rank-normalized-split-\(\widehat{R}\) and rank-normalized-folded-split-\(\widehat{R}\): max(zsRhat, zfsRhat)
  • from classic effective sample size estimate (neff), which currently doesn’t use chain splitting, to rank-normalized-split-\(R_{\rm eff}\) (zsreff) and rank-normalized-folded-split-\(R_{\rm eff}\) (zfsreff)
  • if computing median and MAD_SD, report corresponding \(R_{\rm eff}\)’s (medsreff and madsreff)

Justifications for the change are:

  • Rank normalization makes zsRhat and zsreff well defined for all distributions, under bijective transformations invariant and more stable than sRhat and neff. Adding a folded versions zfsRhat and zfsreff helps to detect differences in scale.
  • Relative efficiencies of the median and MAD (medsreff and masreff) are well defined for all distributions, bijective transformation invariant, more stable than neff, and focus on the quantities we are actually reporting.

In summary outputs, we propose to eventually use Rhat to denote the new version. However, to make it more clear that rank-normalized efficiency is different than standard efficiency, we could use the term Reff and display relative efficiency. The relative efficiency is also easier to check for low values as we don’t need to compare it to the total number of draws.

3.12 Proposed additions to bayesplot

The proposal is to add to bayesplot package the following

  • Rank plot
  • Relative efficiency of quantiles plot
  • Relative efficiency of small probability interval plot

3.13 Warning thresholds

Based on the experiments presented in the appdendices, more strict convergence diagnostics and relative efficiency warning limits could be used. We propose the following warning thresholds, although additional experiments would be useful:

  • 1.01 or 1.02 for new Rhat max(zsRhat, zfsRhat)
  • 0.1 for new relative efficiency (zsreff, zfsreff, medsreff, madsreff)

Plots shown in the upcoming section have dashed lines based on these thresholds (in continuous plots, a dashed line at 1.005 is plotted instead of 1.01, as values larger than that are usually rounded in our summaries to 1.01).

4 Examples

In this section, we will go through some examples to demonstrate the usefulness of our proposed methods as well as the associated workflow in determining convergence. First, we load all the necessary R packages and additional functions.

library(tidyverse)
library(gridExtra)
library(latex2exp)
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans"))
source('monitornew.R')
source('monitorplot.R')

4.1 Cauchy: A distribution with infinite mean and variance

The classic split-\(\widehat{R}\) is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is not defined (i.e., infinite), the classic split-\(\widehat{R}\) is not well justified. In this section, we will use the Cauchy distribution as an example of such a distribution.

The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution

4.1.1 Nominal parameterization of Cauchy

The nominal Cauchy model with direct parameterization is as follows:

writeLines(readLines("cauchy_nom.stan"))
parameters {
  vector[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

4.1.1.1 Default Stan options

Run the nominal model:

fit_nom <- stan(file = 'cauchy_nom.stan', seed = 7878, refresh = 0)
Warning: There were 1233 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems

Treedepth exceedence and Bayesian Fraction of Missing Information are dynamic HMC specific diagnostics (Betancourt, 2017). We get warnings about a very large number of transitions after warmup that exceeded the maximum treedepth, which is likely due to very long tails of the Cauchy distribution. All chains have low estimated Bayesian fraction of missing information also indicating slow mixing.

mon <- monitor(fit_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

          Q5    Q50    Q95 SE_Q5 SE_Q50 SE_Q95 Rhat Bulk_Reff Tail_Reff
x[1]   -5.74  -0.01   6.31  1.05   0.03   1.49 1.03      0.30      0.06
x[2]   -5.83  -0.01   6.07  0.90   0.03   1.25 1.01      0.66      0.11
x[3]   -5.23   0.01   5.73  0.69   0.03   0.76 1.01      0.67      0.13
x[4]   -6.25  -0.02   6.90  1.03   0.03   1.19 1.01      0.91      0.10
x[5]   -9.66  -0.05   5.11  5.56   0.03   0.79 1.01      0.16      0.08
x[6]   -5.26   0.00   5.41  0.56   0.02   0.66 1.01      0.77      0.14
x[7]   -6.35   0.06  10.77  0.98   0.03   4.44 1.02      0.15      0.07
x[8]   -6.45  -0.01   5.37  0.65   0.02   0.72 1.00      0.66      0.11
x[9]   -6.53   0.00   6.30  0.67   0.03   0.59 1.01      0.78      0.12
x[10]  -6.12  -0.01   5.92  0.62   0.02   0.83 1.01      0.61      0.11
x[11]  -6.73   0.00   6.15  1.04   0.02   1.09 1.01      0.52      0.10
x[12]  -5.68  -0.03   4.88  0.63   0.03   0.70 1.00      0.66      0.13
x[13]  -4.53  -0.06   4.27  0.38   0.02   0.53 1.00      0.79      0.16
x[14]  -4.88   0.00   5.03  0.90   0.02   0.99 1.00      0.37      0.14
x[15] -14.49  -0.01  11.51  7.43   0.05   3.52 1.03      0.12      0.05
x[16]  -7.03   0.01   6.96  1.54   0.03   0.99 1.01      0.58      0.08
x[17]  -6.59   0.01   7.69  1.19   0.02   2.03 1.01      0.57      0.10
x[18]  -4.51   0.05   6.92  0.49   0.03   1.53 1.01      0.66      0.08
x[19]  -7.66  -0.05   6.08  2.84   0.02   1.03 1.01      0.29      0.11
x[20]  -5.78   0.03  11.33  0.81   0.03   9.28 1.03      0.09      0.04
x[21]  -4.89   0.02   5.38  0.72   0.02   0.76 1.01      0.82      0.12
x[22]  -5.59   0.04   5.37  0.94   0.02   1.12 1.01      0.53      0.10
x[23] -14.45   0.01   7.00 29.24   0.03   2.10 1.01      0.10      0.06
x[24]  -5.93  -0.04   5.47  1.56   0.02   0.80 1.02      0.36      0.07
x[25]  -7.07  -0.02   5.94  1.87   0.03   0.72 1.01      0.39      0.09
x[26]  -8.96  -0.06   5.85  2.17   0.04   0.59 1.01      0.44      0.09
x[27]  -8.81  -0.01   8.34  2.63   0.02   2.05 1.00      0.45      0.07
x[28]  -5.26   0.02   5.93  1.11   0.03   0.99 1.02      0.94      0.11
x[29]  -5.88   0.00   5.90  0.78   0.03   0.81 1.01      0.91      0.11
x[30]  -5.57  -0.02   5.14  0.98   0.02   0.58 1.00      1.09      0.12
x[31]  -7.36   0.01   7.25  0.75   0.02   0.84 1.00      0.85      0.11
x[32]  -8.85  -0.04   6.50  4.24   0.02   0.69 1.01      0.14      0.07
x[33]  -4.79   0.03   4.96  0.70   0.02   0.45 1.01      0.66      0.14
x[34]  -5.91  -0.03   5.37  0.81   0.03   0.89 1.01      0.60      0.13
x[35]  -6.10   0.02   6.79  0.98   0.03   1.05 1.01      0.66      0.10
x[36]  -4.86   0.10  15.60  0.70   0.05  70.12 1.04      0.04      0.03
x[37]  -8.93   0.00   7.86  2.14   0.03   1.86 1.01      0.29      0.09
x[38]  -5.54  -0.02   4.83  1.16   0.02   0.66 1.00      0.84      0.12
x[39]  -5.89  -0.03   5.84  1.05   0.02   0.94 1.02      1.12      0.11
x[40]  -8.71   0.01   6.71  4.36   0.02   1.16 1.00      0.22      0.06
x[41]  -6.88  -0.03   7.55  1.44   0.03   1.13 1.00      0.43      0.09
x[42]  -6.15   0.01   6.48  1.09   0.02   1.45 1.01      0.38      0.09
x[43]  -6.38   0.00   7.39  1.32   0.03   1.46 1.01      0.69      0.08
x[44]  -7.84   0.00   7.99  1.06   0.03   1.69 1.01      0.75      0.12
x[45]  -4.77  -0.02   4.66  0.40   0.03   0.35 1.01      0.73      0.14
x[46]  -4.84   0.00   5.99  0.69   0.03   1.13 1.00      0.24      0.11
x[47]  -8.51   0.03  24.26  4.93   0.05  22.21 1.07      0.06      0.02
x[48]  -6.73   0.00   5.33  1.32   0.03   0.71 1.01      0.48      0.10
x[49]  -6.27   0.04   6.92  0.92   0.03   1.65 1.00      0.37      0.09
x[50]  -5.21   0.00   4.65  0.66   0.02   0.64 1.00      0.78      0.13
I       0.00   0.50   1.00  0.00   0.50     NA  NaN      0.10       NaN
lp__  -92.71 -69.21 -49.03  1.74   1.34   0.98 1.05      0.03      0.12

For each parameter, Bulk_Reff and Tail_Reff are crude measures of relative
effective sample size for bulk and tail quantities respectively (good mixing
Reff > 0.1), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
which_min_eff <- which.min(mon[1:50, 'Bulk_Reff'])

Several Rhat > 1.01 and some Reff < 0.1 indicate that the results should not be trusted. The extended case study rhat_reff_extra has more results with longer chains as well.

We can further analyze potential problems using local relative efficiency and rank plots. We specifically investigate x[36], which has the smallest bulk relative efficiency 0.03.

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates (see Section Relative efficiency of small interval probability estimates). Each interval contains \(1/k\) of the draws (e.g., \(5\%\) each, if \(k=20\)). The small interval efficiency measures mixing of an function which indicates when the values are inside or outside the specific small interval. As detailed above, this gives us a local efficiency measure which does not depend on the shape of the distribution.

plot_local_reff(fit = fit_nom, par = which_min_eff, nalpha = 20)

We see that the efficiency of our posterior draws is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show iterations that exceeded the maximum treedepth.

An alternative way to examine the relative efficiency in different parts of the posterior is to compute relative efficiencies for quantiles (see Section Relative efficiency of quantiles). Each interval has a specified proportion of draws, and the efficiency measures mixing of a function which indicates when the values are smaller than or equal to the corresponding quantile.

plot_quantile_reff(fit = fit_nom, par = which_min_eff, nalpha = 40)

Similar as above, we see that the efficiency of our posterior draws is worryingly low in the tails. Again, orange ticks show iterations that exceeded the maximum treedepth.

We can further analyze potential problems using rank plots in which we clearly see differences between chains.

samp <- as.array(fit_nom)
xmin <- paste0("x[", which_min_eff, "]")
mcmc_hist_r_scale(samp[, , xmin])

4.1.2 Alternative parameterization of Cauchy

Next we examine an alternative parameterization that considers the Cauchy distribution as a scale mixture of Gaussian distributions. The model has two parameters and the Cauchy distributed \(x\)’s can be computed from those. In addition to improved sampling performance, the example illustrates that focusing on diagnostics matters.

writeLines(readLines("cauchy_alt_1.stan"))
parameters {
  vector[50] x_a;
  vector<lower=0>[50] x_b;
}

transformed parameters {
  vector[50] x = x_a ./ sqrt(x_b);
}

model {
  x_a ~ normal(0, 1);
  x_b ~ gamma(0.5, 0.5);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Run the alternative model:

fit_alt1 <- stan(file = 'cauchy_alt_1.stan', seed = 7878, refresh = 0)

There are no warnings, and the sampling is much faster.

mon <- monitor(fit_alt1)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

            Q5    Q50    Q95 SE_Q5 SE_Q50 SE_Q95 Rhat Bulk_Reff Tail_Reff
x_a[1]   -1.65  -0.02   1.66  0.03   0.03   0.03 1.00      1.09      0.52
x_a[2]   -1.67   0.00   1.63  0.03   0.02   0.04 1.00      1.00      0.61
x_a[3]   -1.66  -0.03   1.62  0.03   0.02   0.05 1.00      1.05      0.61
x_a[4]   -1.63  -0.01   1.69  0.03   0.02   0.05 1.00      0.91      0.55
x_a[5]   -1.64  -0.01   1.58  0.04   0.02   0.02 1.00      0.86      0.62
x_a[6]   -1.62  -0.01   1.56  0.05   0.02   0.04 1.00      0.99      0.61
x_a[7]   -1.61   0.00   1.65  0.05   0.02   0.04 1.00      0.98      0.57
x_a[8]   -1.67  -0.03   1.60  0.03   0.02   0.04 1.00      0.93      0.62
x_a[9]   -1.63  -0.04   1.55  0.04   0.02   0.05 1.00      0.96      0.59
x_a[10]  -1.66  -0.03   1.74  0.04   0.02   0.04 1.00      0.86      0.56
x_a[11]  -1.56  -0.02   1.55  0.03   0.02   0.04 1.00      0.88      0.59
x_a[12]  -1.64   0.00   1.69  0.03   0.02   0.04 1.00      0.87      0.59
x_a[13]  -1.59   0.01   1.63  0.04   0.02   0.03 1.00      0.87      0.57
x_a[14]  -1.72  -0.03   1.63  0.03   0.02   0.04 1.00      0.97      0.54
x_a[15]  -1.65   0.02   1.70  0.05   0.02   0.04 1.00      0.96      0.58
x_a[16]  -1.67  -0.01   1.62  0.04   0.02   0.04 1.00      0.90      0.61
x_a[17]  -1.63  -0.01   1.66  0.04   0.02   0.03 1.00      1.09      0.58
x_a[18]  -1.65  -0.01   1.69  0.03   0.02   0.03 1.00      1.16      0.56
x_a[19]  -1.61  -0.02   1.59  0.03   0.02   0.03 1.00      1.01      0.58
x_a[20]  -1.64  -0.01   1.62  0.04   0.02   0.04 1.00      1.08      0.60
x_a[21]  -1.67   0.04   1.68  0.05   0.02   0.03 1.00      0.91      0.57
x_a[22]  -1.63  -0.01   1.64  0.04   0.02   0.05 1.00      0.99      0.51
x_a[23]  -1.62   0.03   1.70  0.04   0.02   0.05 1.00      1.07      0.56
x_a[24]  -1.66   0.00   1.67  0.04   0.02   0.03 1.00      0.93      0.53
x_a[25]  -1.64   0.02   1.66  0.05   0.02   0.04 1.00      1.10      0.55
x_a[26]  -1.65  -0.02   1.61  0.05   0.02   0.04 1.00      0.96      0.59
x_a[27]  -1.66   0.00   1.68  0.05   0.02   0.03 1.00      0.83      0.53
x_a[28]  -1.69   0.02   1.68  0.04   0.02   0.04 1.00      1.01      0.51
x_a[29]  -1.57   0.00   1.64  0.04   0.02   0.03 1.00      1.02      0.54
x_a[30]  -1.61   0.00   1.64  0.04   0.02   0.04 1.00      0.99      0.55
x_a[31]  -1.65   0.01   1.61  0.04   0.02   0.03 1.00      1.02      0.60
x_a[32]  -1.59  -0.01   1.59  0.03   0.02   0.04 1.00      1.05      0.65
x_a[33]  -1.69  -0.01   1.77  0.04   0.02   0.04 1.00      0.96      0.59
x_a[34]  -1.68   0.01   1.69  0.03   0.02   0.04 1.00      1.00      0.54
x_a[35]  -1.61   0.03   1.67  0.04   0.02   0.05 1.00      0.98      0.58
x_a[36]  -1.67   0.00   1.60  0.04   0.01   0.04 1.00      1.03      0.53
x_a[37]  -1.69  -0.03   1.59  0.04   0.02   0.04 1.00      0.99      0.59
x_a[38]  -1.67  -0.02   1.64  0.03   0.02   0.05 1.00      1.05      0.56
x_a[39]  -1.66  -0.02   1.65  0.05   0.02   0.04 1.00      1.12      0.49
x_a[40]  -1.63   0.00   1.62  0.03   0.02   0.03 1.00      1.03      0.55
x_a[41]  -1.68   0.02   1.61  0.05   0.02   0.05 1.00      0.99      0.55
x_a[42]  -1.64   0.03   1.64  0.04   0.02   0.04 1.00      1.04      0.52
x_a[43]  -1.64  -0.03   1.63  0.04   0.02   0.04 1.00      0.95      0.62
x_a[44]  -1.66  -0.03   1.63  0.03   0.02   0.04 1.00      0.94      0.58
x_a[45]  -1.68   0.02   1.73  0.05   0.02   0.04 1.00      0.95      0.55
x_a[46]  -1.56   0.04   1.66  0.05   0.02   0.04 1.00      1.09      0.64
x_a[47]  -1.66   0.02   1.58  0.04   0.02   0.04 1.00      1.07      0.61
x_a[48]  -1.61   0.00   1.66  0.03   0.02   0.05 1.00      1.00      0.55
x_a[49]  -1.62   0.00   1.64  0.04   0.02   0.03 1.00      0.98      0.58
x_a[50]  -1.62  -0.01   1.58  0.05   0.02   0.03 1.00      0.95      0.61
x_b[1]    0.00   0.44   4.00  0.00   0.02   0.15 1.00      0.57      1.16
x_b[2]    0.00   0.46   3.75  0.00   0.02   0.08 1.00      0.61      1.04
x_b[3]    0.00   0.47   3.89  0.00   0.02   0.18 1.00      0.89      1.08
x_b[4]    0.00   0.46   3.83  0.00   0.02   0.16 1.00      0.67      1.17
x_b[5]    0.00   0.45   3.95  0.00   0.02   0.09 1.00      0.76      0.97
x_b[6]    0.00   0.44   3.80  0.00   0.02   0.10 1.00      0.82      1.11
x_b[7]    0.01   0.43   3.62  0.00   0.02   0.14 1.00      0.72      1.09
x_b[8]    0.00   0.44   3.79  0.00   0.02   0.12 1.00      0.72      1.24
x_b[9]    0.00   0.50   3.67  0.00   0.02   0.15 1.00      0.71      1.01
x_b[10]   0.00   0.44   3.79  0.00   0.02   0.13 1.00      0.63      0.99
x_b[11]   0.01   0.49   3.90  0.00   0.02   0.10 1.00      0.90      1.03
x_b[12]   0.00   0.44   3.82  0.00   0.02   0.12 1.00      0.60      1.05
x_b[13]   0.00   0.46   3.97  0.00   0.03   0.11 1.00      0.51      1.13
x_b[14]   0.00   0.44   3.97  0.00   0.02   0.14 1.00      0.71      1.14
x_b[15]   0.01   0.45   3.77  0.00   0.02   0.07 1.00      0.71      1.18
x_b[16]   0.00   0.48   3.79  0.00   0.02   0.09 1.00      0.67      1.03
x_b[17]   0.01   0.46   3.93  0.00   0.02   0.09 1.00      0.69      1.13
x_b[18]   0.01   0.50   4.11  0.00   0.02   0.13 1.00      0.67      1.03
x_b[19]   0.00   0.45   3.92  0.00   0.02   0.14 1.00      0.60      1.12
x_b[20]   0.00   0.42   3.96  0.00   0.02   0.11 1.00      0.57      1.11
x_b[21]   0.01   0.49   3.94  0.00   0.02   0.13 1.00      0.77      0.99
x_b[22]   0.00   0.45   3.95  0.00   0.02   0.12 1.00      0.75      1.04
x_b[23]   0.00   0.46   3.95  0.00   0.02   0.11 1.00      0.45      1.17
x_b[24]   0.00   0.44   3.86  0.00   0.02   0.07 1.00      0.48      1.02
x_b[25]   0.00   0.45   3.66  0.00   0.02   0.14 1.00      0.59      0.96
x_b[26]   0.00   0.47   4.05  0.00   0.02   0.11 1.00      0.61      1.07
x_b[27]   0.00   0.45   3.90  0.00   0.02   0.13 1.00      0.69      1.10
x_b[28]   0.00   0.46   3.79  0.00   0.02   0.10 1.00      0.84      1.08
x_b[29]   0.01   0.46   3.87  0.00   0.01   0.16 1.00      0.86      1.16
x_b[30]   0.00   0.44   3.89  0.00   0.02   0.11 1.00      0.71      1.13
x_b[31]   0.00   0.49   3.84  0.00   0.02   0.12 1.00      0.76      1.05
x_b[32]   0.00   0.43   3.75  0.00   0.02   0.09 1.00      0.57      0.96
x_b[33]   0.00   0.45   3.79  0.00   0.02   0.15 1.00      0.77      1.13
x_b[34]   0.00   0.47   3.97  0.00   0.02   0.10 1.00      0.83      1.08
x_b[35]   0.00   0.48   3.84  0.00   0.02   0.18 1.00      0.62      0.97
x_b[36]   0.00   0.44   3.89  0.00   0.02   0.18 1.00      0.78      1.12
x_b[37]   0.00   0.46   3.70  0.00   0.02   0.13 1.00      0.66      1.15
x_b[38]   0.01   0.45   3.93  0.00   0.02   0.11 1.00      0.79      1.20
x_b[39]   0.00   0.45   3.75  0.00   0.02   0.15 1.00      0.51      1.01
x_b[40]   0.00   0.42   3.76  0.00   0.02   0.14 1.00      0.66      1.11
x_b[41]   0.00   0.46   3.78  0.00   0.02   0.11 1.00      0.66      1.06
x_b[42]   0.00   0.45   3.89  0.00   0.02   0.11 1.00      0.58      1.05
x_b[43]   0.00   0.47   4.03  0.00   0.02   0.13 1.00      0.74      0.96
x_b[44]   0.00   0.43   3.69  0.00   0.02   0.11 1.00      0.64      1.21
x_b[45]   0.00   0.44   3.66  0.00   0.02   0.09 1.00      0.68      1.11
x_b[46]   0.00   0.46   3.74  0.00   0.02   0.07 1.00      0.63      1.06
x_b[47]   0.01   0.47   3.83  0.00   0.02   0.13 1.00      0.99      1.05
x_b[48]   0.01   0.48   3.89  0.00   0.02   0.10 1.00      0.80      1.12
x_b[49]   0.00   0.47   3.74  0.00   0.02   0.09 1.00      0.64      1.00
x_b[50]   0.00   0.46   4.01  0.00   0.02   0.14 1.00      0.72      1.07
x[1]     -6.47  -0.02   6.52  0.66   0.03   0.63 1.01      0.98      0.44
x[2]     -6.52   0.00   6.50  0.37   0.03   0.77 1.00      0.94      0.54
x[3]     -6.31  -0.03   5.95  0.55   0.02   0.40 1.00      0.92      0.60
x[4]     -6.76  -0.01   5.86  0.52   0.02   0.51 1.00      0.81      0.52
x[5]     -6.67  -0.01   5.75  0.70   0.02   0.49 1.00      0.84      0.57
x[6]     -5.64  -0.01   6.48  0.38   0.03   0.57 1.00      0.95      0.61
x[7]     -6.59   0.00   6.37  0.70   0.03   0.51 1.00      0.87      0.60
x[8]     -6.50  -0.04   6.29  0.50   0.03   0.49 1.00      0.87      0.58
x[9]     -5.73  -0.04   5.96  0.60   0.02   0.44 1.00      0.87      0.55
x[10]    -6.37  -0.04   6.43  0.85   0.02   0.67 1.00      0.72      0.52
x[11]    -5.61  -0.02   5.55  0.37   0.02   0.38 1.00      0.86      0.60
x[12]    -7.12   0.00   6.19  0.72   0.02   0.53 1.00      0.84      0.55
x[13]    -6.44   0.02   6.22  0.67   0.03   0.52 1.00      0.83      0.41
x[14]    -6.72  -0.04   6.56  0.67   0.03   0.59 1.00      0.92      0.53
x[15]    -5.68   0.02   6.06  0.66   0.03   0.42 1.00      0.88      0.54
x[16]    -6.99  -0.01   7.24  0.77   0.03   0.47 1.00      0.90      0.54
x[17]    -5.78  -0.01   5.39  0.45   0.03   0.56 1.00      0.97      0.58
x[18]    -5.45  -0.01   5.92  0.57   0.02   0.58 1.00      1.08      0.47
x[19]    -7.16  -0.02   5.84  0.73   0.03   0.59 1.00      0.88      0.52
x[20]    -7.39  -0.01   6.68  0.76   0.02   0.80 1.00      0.85      0.49
x[21]    -5.93   0.05   5.85  0.75   0.03   0.48 1.00      0.85      0.61
x[22]    -6.96  -0.02   6.23  0.76   0.02   0.69 1.00      0.89      0.54
x[23]    -5.99   0.05   6.54  0.49   0.03   0.61 1.00      0.86      0.45
x[24]    -6.75   0.00   6.99  0.68   0.03   0.72 1.00      0.89      0.35
x[25]    -6.33   0.02   6.40  0.57   0.03   0.62 1.00      1.01      0.48
x[26]    -5.57  -0.02   6.32  0.34   0.03   0.53 1.00      0.83      0.64
x[27]    -6.40   0.00   6.17  0.41   0.02   0.55 1.00      0.84      0.52
x[28]    -5.70   0.02   6.49  0.47   0.03   0.63 1.00      0.86      0.53
x[29]    -5.40   0.00   5.67  0.39   0.03   0.37 1.00      0.98      0.61
x[30]    -5.83  -0.01   6.45  0.47   0.02   0.62 1.00      0.91      0.55
x[31]    -5.85   0.02   5.66  0.60   0.03   0.52 1.00      0.90      0.57
x[32]    -6.49  -0.01   6.18  0.55   0.03   0.55 1.00      1.06      0.51
x[33]    -7.08  -0.01   6.34  0.75   0.03   0.59 1.00      0.95      0.57
x[34]    -6.59   0.01   6.05  0.47   0.02   0.64 1.00      1.02      0.59
x[35]    -5.44   0.03   6.13  0.49   0.02   0.55 1.00      0.94      0.55
x[36]    -6.15   0.00   6.11  0.52   0.02   0.51 1.00      0.90      0.61
x[37]    -6.08  -0.03   5.34  0.64   0.02   0.42 1.00      0.91      0.55
x[38]    -5.74  -0.02   5.77  0.49   0.02   0.49 1.00      0.95      0.64
x[39]    -6.73  -0.03   5.84  0.66   0.03   0.60 1.00      1.03      0.40
x[40]    -6.39   0.01   6.52  0.85   0.03   0.37 1.00      0.90      0.52
x[41]    -6.01   0.02   5.92  0.49   0.03   0.48 1.00      0.87      0.52
x[42]    -7.39   0.05   6.86  0.78   0.03   0.88 1.00      0.89      0.45
x[43]    -5.98  -0.03   6.69  0.41   0.02   0.61 1.00      0.96      0.56
x[44]    -7.04  -0.04   6.17  0.60   0.03   0.63 1.00      0.83      0.56
x[45]    -5.75   0.02   6.23  0.39   0.03   0.58 1.00      0.94      0.55
x[46]    -5.59   0.06   6.33  0.37   0.03   0.93 1.00      0.97      0.47
x[47]    -5.49   0.03   5.43  0.30   0.02   0.45 1.00      0.97      0.69
x[48]    -5.96   0.00   4.85  0.42   0.03   0.38 1.00      0.92      0.58
x[49]    -6.55   0.00   5.25  0.69   0.03   0.43 1.00      0.89      0.52
x[50]    -6.74  -0.01   6.90  0.64   0.03   0.62 1.00      0.86      0.57
I         0.00   0.00   1.00  0.00   0.50     NA 1.00      0.66      0.66
lp__    -95.19 -80.99 -68.66  0.55   0.25   0.26 1.00      0.33      0.59

For each parameter, Bulk_Reff and Tail_Reff are crude measures of relative
effective sample size for bulk and tail quantities respectively (good mixing
Reff > 0.1), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).
which_min_eff <- which.min(mon[101:150, 'Bulk_Reff'])

All Rhat < 1.01 and Reff > 0.1 indicate the sampling worked much better with the alternative parameterization. The extended case study rhat_reff_extra has more results using alternative parameterizations.

We can further analyze potential problems using local relative efficiency and rank plots. We take a detailed look at x[10], which has the smallest bulk relative efficiency of 0.86.

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates.

plot_local_reff(fit = fit_alt1, par = which_min_eff + 100, nalpha = 20)

The relative efficiency is good in all parts of the posterior. Further, we examine the relative efficiency of different quantile estimates.

plot_quantile_reff(fit = fit_alt1, par = which_min_eff + 100, nalpha = 40)

Rank plots also look rather similar across chains.

samp <- as.array(fit_alt1)
xmin <- paste0("x[", which_min_eff, "]")
mcmc_hist_r_scale(samp[, , xmin])

In summary, the alterative parameterization produces results that look much better than for the nominal parameterization. There are still some differences in the tails, which we also identified via the tail relative efficiency estimates.

4.1.3 Half-Cauchy with nominal parameterization

Half-Cauchy priors are common and, for example, in Stan usually set using the nominal parameterization. However, when the constraint <lower=0> is used, Stan does the sampling automatically in the unconstrained log(x) space, which changes the geometry crucially.

writeLines(readLines("half_cauchy_nom.stan"))
parameters {
  vector<lower=0>[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Run the half-Cauchy with nominal parameterization (and positive constraint):

fit_half_nom <- stan(file = 'half_cauchy_nom.stan', seed = 7878, refresh = 0)

There are no warnings, and the sampling is much faster than for the Cauchy nominal model.

mon <- monitor(fit_half_nom)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

          Q5    Q50    Q95 SE_Q5 SE_Q50 SE_Q95 Rhat Bulk_Reff Tail_Reff
x[1]    0.08   1.03  13.56  0.01   0.02   1.63    1      2.02      0.43
x[2]    0.10   1.02  10.79  0.01   0.02   0.93    1      2.47      0.50
x[3]    0.06   1.00  12.86  0.01   0.02   1.22    1      1.97      0.43
x[4]    0.09   0.98  11.50  0.01   0.02   1.05    1      1.90      0.42
x[5]    0.08   1.03  14.01  0.01   0.02   1.49    1      1.87      0.42
x[6]    0.08   0.99  11.72  0.01   0.01   1.13    1      1.99      0.46
x[7]    0.07   0.98  12.02  0.01   0.02   1.32    1      2.08      0.41
x[8]    0.09   0.99  10.82  0.01   0.02   0.77    1      2.05      0.42
x[9]    0.09   1.02  11.56  0.01   0.02   0.87    1      2.12      0.41
x[10]   0.07   1.03  14.90  0.01   0.02   1.21    1      1.99      0.45
x[11]   0.08   0.97  12.54  0.01   0.02   1.36    1      1.75      0.34
x[12]   0.07   1.03  13.33  0.01   0.02   1.23    1      2.06      0.46
x[13]   0.08   1.01  14.11  0.01   0.02   1.37    1      2.02      0.43
x[14]   0.07   0.96  13.48  0.01   0.03   1.26    1      1.86      0.44
x[15]   0.07   1.00  14.43  0.01   0.02   1.33    1      1.69      0.41
x[16]   0.08   0.99  14.06  0.01   0.02   1.46    1      1.85      0.46
x[17]   0.08   0.99  11.70  0.01   0.02   1.40    1      1.79      0.42
x[18]   0.09   0.98  11.56  0.01   0.02   1.24    1      1.90      0.44
x[19]   0.07   0.99  14.31  0.01   0.02   1.42    1      1.96      0.38
x[20]   0.09   1.00  11.21  0.01   0.02   1.22    1      1.98      0.48
x[21]   0.07   1.00  12.01  0.01   0.03   0.88    1      1.96      0.41
x[22]   0.09   1.02  12.69  0.01   0.02   0.92    1      1.93      0.40
x[23]   0.07   0.98  13.46  0.01   0.02   1.32    1      1.78      0.35
x[24]   0.07   1.00  12.31  0.01   0.02   1.08    1      1.72      0.38
x[25]   0.09   1.00  11.53  0.01   0.02   0.62    1      1.94      0.48
x[26]   0.08   0.98  10.68  0.01   0.02   1.08    1      1.61      0.44
x[27]   0.08   1.01  13.37  0.01   0.02   1.06    1      1.75      0.42
x[28]   0.08   0.97  11.41  0.01   0.02   0.60    1      2.41      0.43
x[29]   0.07   1.00  13.68  0.01   0.03   1.21    1      1.53      0.46
x[30]   0.09   1.02  10.99  0.01   0.02   0.89    1      1.99      0.44
x[31]   0.08   0.96  12.42  0.01   0.02   1.02    1      1.62      0.47
x[32]   0.10   1.01  11.46  0.01   0.02   1.24    1      1.76      0.43
x[33]   0.08   0.99  13.76  0.01   0.01   1.47    1      1.73      0.44
x[34]   0.08   0.98  12.38  0.01   0.02   0.91    1      2.15      0.43
x[35]   0.07   0.96  13.57  0.01   0.02   1.74    1      1.60      0.36
x[36]   0.06   1.00  13.54  0.01   0.02   1.29    1      1.92      0.42
x[37]   0.07   1.00  13.87  0.01   0.02   1.10    1      1.82      0.44
x[38]   0.08   1.02  12.23  0.01   0.02   0.98    1      1.70      0.44
x[39]   0.10   1.01  11.69  0.01   0.02   0.90    1      1.93      0.42
x[40]   0.09   0.96  12.06  0.01   0.02   1.16    1      1.77      0.45
x[41]   0.07   0.96  12.87  0.00   0.02   0.94    1      2.16      0.41
x[42]   0.07   1.02  13.33  0.01   0.02   1.00    1      2.18      0.43
x[43]   0.08   0.97  10.25  0.01   0.02   0.65    1      2.19      0.49
x[44]   0.08   1.03  12.29  0.01   0.02   0.88    1      1.59      0.42
x[45]   0.08   0.98  12.22  0.01   0.02   1.07    1      2.11      0.41
x[46]   0.08   0.99  12.06  0.01   0.02   0.99    1      2.05      0.45
x[47]   0.08   1.01  14.50  0.01   0.02   1.38    1      2.39      0.39
x[48]   0.08   1.01  13.05  0.01   0.01   1.18    1      2.10      0.43
x[49]   0.08   1.00  12.93  0.01   0.02   1.10    1      2.00      0.39
x[50]   0.08   1.00  13.22  0.01   0.02   1.08    1      1.88      0.43
I       0.00   0.00   1.00  0.00   0.00     NA    1      1.84      1.84
lp__  -80.63 -69.06 -59.31  0.29   0.22   0.30    1      0.30      0.55

For each parameter, Bulk_Reff and Tail_Reff are crude measures of relative
effective sample size for bulk and tail quantities respectively (good mixing
Reff > 0.1), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

All Rhat < 1.01 and Reff > 0.1 indicate good performance of the sampler. We see that the Stan’s automatic (and implicit) transformation of constraint parameters can have a big effect on the sampling performance. More experimentes with different parameterizations of the half-Cauchy distribution can be found in rhat_reff_extra.Rmd.

4.2 Hierarchical model: Eight Schools

The Eight Schools data is a classic example for hierarchical models (see Section 5.5 in Gelman et al., 2013), which despite the apparent simplicity nicely illustrates the typical problems in inference for hierarchical models. The Stan models below are from Michael Betancourt’s case study on Diagnosing Biased Inference with Divergences.

4.2.1 A Centered Eight Schools model

writeLines(readLines("eight_schools_cp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta[J];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta ~ normal(mu, tau);
  y ~ normal(theta, sigma);
}

4.2.1.1 Centered Eight Schools model

We directly run the centered parameterization model with an increased adapt_delta value to reduce the probability of getting divergent transitions.

input_data <- read_rdump("eight_schools.data.R")
fit_cp <- stan(
  file = 'eight_schools_cp.stan', data = input_data,
  iter = 2000, chains = 4, seed = 483892929, refresh = 0,
  control = list(adapt_delta = 0.95)
)
Warning: There were 113 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems

Despite an increased adapt_delta, we still observe a lot of divergent transitions, which in itself is already sufficient to not trust the results. Still, we can use Rhat and Reff diagnostics to recognize problematic parts of the posterior.

mon <- monitor(fit_cp)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

             Q5    Q50   Q95 SE_Q5 SE_Q50 SE_Q95 Rhat Bulk_Reff Tail_Reff
mu        -1.11   4.53  9.90  0.24   0.20   0.24 1.02      0.14      0.16
tau        0.39   2.85  9.61  0.05   0.29   0.38 1.07      0.02      0.14
theta[1]  -2.24   5.81 16.29  0.29   0.28   0.62 1.02      0.19      0.08
theta[2]  -2.60   5.07 13.43  0.35   0.27   0.42 1.01      0.24      0.13
theta[3]  -5.01   4.35 12.11  0.49   0.27   0.47 1.01      0.22      0.11
theta[4]  -2.86   5.00 12.82  0.27   0.24   0.33 1.01      0.25      0.14
theta[5]  -4.74   4.03 10.83  0.37   0.28   0.33 1.01      0.18      0.25
theta[6]  -4.15   4.28 11.60  0.41   0.26   0.35 1.01      0.21      0.20
theta[7]  -1.30   5.97 15.59  0.34   0.27   0.37 1.02      0.15      0.11
theta[8]  -3.37   5.12 13.84  0.35   0.28   0.40 1.01      0.23      0.10
lp__     -24.68 -15.04  0.37  0.35   0.73   1.43 1.07      0.02      0.03

For each parameter, Bulk_Reff and Tail_Reff are crude measures of relative
effective sample size for bulk and tail quantities respectively (good mixing
Reff > 0.1), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

See the extended case study rhat_reff_extra for results of longer chains.

Bulk-\(R_{\rm eff}\) for the between school standard deviation tau is 0.05<0.01, indicating we should investigate that parameter more carefully. We thus examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval estimates for tau. These plots may either show quantiles or parameter values at the vertical axis. Red ticks show divergent transitions.

plot_local_reff(fit = fit_cp, par = "tau", nalpha = 20)

plot_local_reff(fit = fit_cp, par = "tau", nalpha = 20, rank = FALSE)

We see that the sampler has difficulties in exploring small tau values. As the efficiency for estimating small tau values is practically zero, we may assume that we may miss substantial amount of posterior mass and get biased estimates. Red ticks, which show iterations with divergences, have concentrated to small tau values, indicate also problems exploring small values which is likely to cause bias.

We examine also the relative efficiency of different quantile estimates. Again, these plots may either show quantiles or parameter values at the vertical axis.

plot_quantile_reff(fit = fit_cp, par = 2, nalpha = 40)

plot_quantile_reff(fit = fit_cp, par = 2, nalpha = 40, rank = FALSE)

Most of the quantile estimates have worryingly low relative efficiency.

In lines with these findings, the rank plots of tau clearly show problems in the mixing of the chains.

samp_cp <- as.array(fit_cp)
mcmc_hist_r_scale(samp_cp[, , "tau"])

4.2.2 Non-centered Eight Schools model

For hierarchical models, the non-centered parameterization often works better than the centered one:

writeLines(readLines("eight_schools_ncp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta_tilde[J];
}

transformed parameters {
  real theta[J];
  for (j in 1:J)
    theta[j] = mu + tau * theta_tilde[j];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta_tilde ~ normal(0, 1);
  y ~ normal(theta, sigma);
}

For reasons of comparability, we also run the non-centered parameterization model with an increased adapt_delta value:

fit_ncp2 <- stan(
  file = 'eight_schools_ncp.stan', data = input_data,
  iter = 2000, chains = 4, seed = 483892929, refresh = 0,
  control = list(adapt_delta = 0.95)
)

We get zero divergences and no other warnings which is a first good sign.

mon <- monitor(fit_ncp2)
print(mon)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

                   Q5   Q50   Q95 SE_Q5 SE_Q50 SE_Q95 Rhat Bulk_Reff Tail_Reff
mu              -1.14  4.42  9.96  0.13   0.05   0.12    1      1.38      0.52
tau              0.30  2.81  9.50  0.03   0.05   0.14    1      0.72      0.84
theta_tilde[1]  -1.29  0.31  1.88  0.05   0.01   0.04    1      1.26      0.46
theta_tilde[2]  -1.44  0.11  1.62  0.04   0.02   0.04    1      1.04      0.50
theta_tilde[3]  -1.64 -0.08  1.48  0.03   0.02   0.03    1      1.62      0.52
theta_tilde[4]  -1.51  0.08  1.62  0.03   0.01   0.04    1      1.52      0.47
theta_tilde[5]  -1.71 -0.17  1.39  0.04   0.01   0.03    1      1.40      0.54
theta_tilde[6]  -1.64 -0.06  1.53  0.04   0.02   0.03    1      1.21      0.47
theta_tilde[7]  -1.25  0.40  1.87  0.04   0.01   0.03    1      1.20      0.53
theta_tilde[8]  -1.54  0.07  1.66  0.04   0.01   0.04    1      1.54      0.49
theta[1]        -1.62  5.64 16.31  0.17   0.11   0.50    1      1.23      0.64
theta[2]        -2.42  4.82 12.96  0.23   0.08   0.35    1      1.28      0.63
theta[3]        -4.63  4.26 12.10  0.23   0.05   0.17    1      1.36      0.70
theta[4]        -2.73  4.75 12.48  0.16   0.07   0.19    1      1.17      0.64
theta[5]        -3.96  3.82 11.02  0.19   0.06   0.15    1      1.34      0.68
theta[6]        -3.88  4.27 11.44  0.18   0.07   0.17    1      1.42      0.60
theta[7]        -0.98  5.88 15.43  0.11   0.09   0.25    1      1.18      0.66
theta[8]        -3.23  4.78 13.13  0.23   0.08   0.24    1      1.23      0.64
lp__           -11.19 -6.60 -3.78  0.12   0.05   0.06    1      0.41      0.65

For each parameter, Bulk_Reff and Tail_Reff are crude measures of relative
effective sample size for bulk and tail quantities respectively (good mixing
Reff > 0.1), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

All Rhat < 1.01 and Rhat > 0.1 indicate a much better performance of the non-centered parameterization.

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates for tau.

plot_local_reff(fit = fit_ncp2, par = 2, nalpha = 20)

Small tau values are still more difficult to explore, but the relative efficiency is in a good range. We may also check this with a finer resolution:

plot_local_reff(fit = fit_ncp2, par = 2, nalpha = 50)

The relative efficiency of different quantile estimates looks good as well.

plot_quantile_reff(fit = fit_ncp2, par = 2, nalpha = 40)

In line with these finding, the rank plots of tau show no substantial differences between chains.

samp_ncp2 <- as.array(fit_ncp2)
mcmc_hist_r_scale(samp_ncp2[, , 2])

5 Appendix

5.1 Appendix A: Abbreviations

The following abbreviations are used throughout the appendices:

  • Rhat = classic no-split-Rhat
  • sRhat = classic split-Rhat
  • zsRhat = rank-normalized split-Rhat
    • all chains are jointly ranked and z-transformed
    • can detect differences in location and trends
  • zfsRhat = rank-normalized folded split-Rhat
    • all chains are jointly “folded” by computing absolute deviation from median, ranked and z-transformed
    • can detect differences in scales
  • neff = no-split effective sample size
  • reff = neff / N
  • zsneff = rank-normalized split effective sample size
    • estimates the efficiency of mean estimate for rank normalized values
  • zsreff = zsneff / N
  • zfsneff = rank-normalized folded split effective sample size
    • estimates the efficiency of rank normalized mean absolute deviation
  • zfsreff = zfsneff / N
  • medsneff = median split effective sample size
    • estimates the efficiency of median
    • approximated using efficiency for \(p(\theta < \hat{Q}_{0.5})\)
  • medsreff = medsneff / N
  • madsneff = mad split effective sample size
    • estimates the efficiency of median absolute deviation
  • madsreff = madsneff / N

5.2 Appendix B: Examples of rank normalization

We illustrate rank normalization with a few examples. We first plot histograms, and empirical cumulative distribution functions (ECDF) with respect to original parameter values (\(\theta\)), scaled ranks (ranks divided by the max rank), and rank normalized values (z). We used scaled ranks to make the plots look similar for different number of draws.

100 draws from N(0,1):

n <- 100
theta <- rnorm(n)
plotranknorm(theta, n)

100 draws from exp(1):

theta <- rexp(n)
plotranknorm(theta, n)

100 draws from Cauchy(0, 1):

theta <- rcauchy(n)
plotranknorm(theta, n)

In the above plots, the ECDF with respect to scaled rank and rank normalized \(z\)-values look all exatly same for all distributions. In Split-\(\widehat{R}\) and effective sample size computations, we rank all draws jointly, but then compare ranks and ECDF of individual split chains. To illustrate the variation between chains, we draw 8 batches of 100 draws each from N(0,1):

n <- 100
m <- 8
theta <- rnorm(n*m)
plotranknorm(theta, n, m)

The variation in ECDF due to the variation ranks is now visible also in scale ranks and rank normalized z-values from different batches.

The benefit of rank normalization is more obvious for non-normal distribution such as Cauchy:

theta <- rcauchy(n*m)
plotranknorm(theta, n, m)

Rank normalization makes the subsequent computations well defined and invariant under bijective transformations. This means that we get the same results, for example, if we use unconstrained or constrained parameterisation in a model.

5.3 Appendix C: Variance of the cumulative distribution function

In Section 3, we had defined the empirical CDF (ECD) for any \(\theta_\alpha\) as \[ p(\theta \leq \theta_\alpha) \approx \bar{I}_\alpha = \frac{1}{S}\sum_{s=1}^S I(\theta^{(s)} \leq\theta_\alpha), \]

For independent draws \(\bar{I}_\alpha\) has a \({\rm Beta}(\bar{I}_\alpha+1, S - \bar{I}_\alpha + 1)\) distribution. Thus we can easily examine variation of ECDF for any \(\theta_\alpha\) value from a single chain. If \(\bar{I}_\alpha\) is not very close to \(1\) or \(S\) and \(S\) is large, we can use the variance of Beta distribution

\[ {\rm Var}[p(\theta \leq \theta_\alpha)] = \frac{(\bar{I}_\alpha+1)*(S-\bar{I}_\alpha+1)}{(S+2)^2(S+3)}. \] We illustrate uncertainty intervals of Beta distribution and normal approximation of ECDF for 100 draws from N(0,1):

n <- 100
m <- 1
theta <- rnorm(n*m)
plotranknorm(theta, n, m, interval = TRUE)

Uncertainty intervals of ECDF for draws from Cauchy(0,1) illustrate again the improved visual clarity in plotting when using scaled ranks:

n <- 100
m <- 1
theta <- rcauchy(n*m)
plotranknorm(theta, n, m, interval = TRUE)

The above plots illustrate that the normal approximation is accurate for practical purposes in MCMC diagnostics.

5.4 Appendix D: Normal distributions with additional trend, shift or scaling

This part focuses on diagnostics for

  • all chains have a trend and similar marginal distribution
  • one of the chains has different mean
  • one of the chains has lower marginal variance

To simplify, in this part independent draws are used as a proxy for very efficient MCMC and we sample draws from a standard-normal distribution. We will discuss the behavior for non-Gaussian distributions later.

5.4.1 Trend: All chains from the same \(N(0,1)\) distribution plus linear trend added to all chains.

conds <- expand.grid(
  iters = c(250, 1000, 4000), 
  trend = c(0, 0.25, 0.5, 0.75, 1),
  rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
  iters <- conds[i, "iters"]
  trend <- conds[i, "trend"]
  rep <- conds[i, "rep"]
  r <- array(rnorm(iters * chains), c(iters, chains))
  r <- r + seq(-trend, trend, length.out = iters)
  rs <- as.data.frame(monitor_extra(r))
  res[[i]] <- cbind(iters, trend, rep, rs)
}
res <- bind_rows(res)

If we don’t split chains, Rhat misses the trends if all chains still have similar marginal distribution.

ggplot(data=res, aes(y=Rhat, x=trend)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Rhat without splitting chains')

Split-Rhat can detect trends, even if the marginals of the chains are similar.

ggplot(data=res, aes(y=zsRhat, x=trend)) + 
  geom_point() + geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Split-Rhat')

Result: Split-Rhat is useful for detecting non-stationarity in the chains. If we use a threshold of \(1.01\), we can detect trends which have account for 2% of the total marginal variance. If we use a threshold of \(1.1\), we can detect trends which account for 30% of the total marginal variance.

Relative efficiency (effective sample size divided by the number of draws) is based on split Rhat and within-chain autocorrelation.

ggplot(data=res, aes(y=zsreff, x=trend)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = c(0,1)) + 
  geom_hline(yintercept = 0.1, linetype = 'dashed') + 
  ggtitle('Relative efficiency (zsreff)') + 
  scale_y_continuous(breaks = seq(0,1,by=0.25))

Result: Split-Rhat is more sensitive to trends in small sample sizes, but relative efficiency is more sensitive with larger samples sizes (as autocorrelations can be estimated more accurately).

Advice: If in doubt, run longer chains for more accurate estimates.

5.4.2 Shifted: Three chains from \(N(0,1)\) and one chain from \(N(a,1)\)

Next we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others, but has a different mean.

conds <- expand.grid(
  iters = c(250, 1000, 4000), 
  shift = c(0, 0.25, 0.5, 0.75, 1),
  rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
  iters <- conds[i, "iters"]
  shift <- conds[i, "shift"]
  rep <- conds[i, "rep"]
  r <- array(rnorm(iters * chains), c(iters, chains))
  r[, 1] <- r[, 1] + shift
  rs <- as.data.frame(monitor_extra(r))
  res[[i]] <- cbind(iters, shift, rep, rs)
}
res <- bind_rows(res)
ggplot(data=res, aes(y=zsRhat, x=shift)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Split-Rhat')

Result: If we use a threshold of \(1.01\), we can detect shift with a magnitude of one third of the marginal standard deviation. If we use a threshold of \(1.1\), we can detect shift with a magnitude equal to the marginal standard deviation. The rank plot can be used to visualize where the problem is.

ggplot(data=res, aes(y=zsreff, x=shift)) + 
  geom_point() +
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = c(0,1)) + 
  geom_hline(yintercept = 0.1, linetype = 'dashed') + 
  ggtitle('Relative efficiency (zsreff)') + 
  scale_y_continuous(breaks = seq(0,1,by=0.25))

Result: The relative efficiency is not as sensitive, but a shift with a magnitude of half the marginal standard deviation will lead to very low efficiency when sample size increases.

Rank plot visualisation for a case with 4 chains and 250 draws per chain, and shift = 0.5.

iters = 250
chains = 4
shift = 0.5
r <- array(rnorm(iters * chains), c(iters, chains))
r[,1] <- r[,1] + shift
colnames(r) <- 1:4
mcmc_hist_r_scale(r)

Rhat is less than \(1.05\), but the rank plot clearly shows the location of the problem (first chain).

5.4.3 Scaled: Three chains from \(N(0,1)\) and one chain from \(N(0,a)\)

Next we investigate the sensitivity to detect if one of the chains has not converged to the same distribution as the others but has lower marginal variance.

conds <- expand.grid(
  iters = c(250, 1000, 4000), 
  scale = c(0, 0.25, 0.5, 0.75, 1),
  rep = 1:10
)
res <- vector("list", nrow(conds))
chains = 4
for (i in 1:nrow(conds)) {
  iters <- conds[i, "iters"]
  scale <- conds[i, "scale"]
  rep <- conds[i, "rep"]
  r <- array(rnorm(iters * chains), c(iters, chains))
  r[, 1] <- r[, 1] * scale
  rs <- as.data.frame(monitor_extra(r))
  res[[i]] <- cbind(iters, scale, rep, rs)
}
res <- bind_rows(res)
ggplot(data = res, aes(y = zsRhat, x = scale)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Split-Rhat')

Result: Split-Rhat is not able to detect scale differences between chains.

ggplot(data = res, aes(y = zfsRhat, x = scale)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = 1.005, linetype = 'dashed') + 
  geom_hline(yintercept = 1) + 
  ggtitle('Folded-split-Rhat')

Result: Folded-Split-Rhat focuses explicitly on scales and can thus detect scale differences.

Result: If we use a threshold of \(1.01\), we can detect a chain with scale less than \(3/4\) of the standard deviation of the others. If we use threshold of \(1.1\), we can detect a chain with standard deviation less than \(1/4\) of the standard deviation of the others.

ggplot(data=res, aes(y=zsreff, x=scale)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = c(0,1)) + 
  geom_hline(yintercept = 0.1, linetype = 'dashed') + 
  ggtitle('Relative efficiency (zsreff)') + 
  scale_y_continuous(breaks = seq(0,1,by=0.25))

Result: The relative efficiency does not see a problem as it focuses on location differences between chains.

ggplot(data=res, aes(y=zfsreff, x=scale)) + 
  geom_point() + 
  geom_jitter() + 
  facet_grid(. ~ iters) + 
  geom_hline(yintercept = c(0,1)) + 
  geom_hline(yintercept = 0.1, linetype = 'dashed') + 
  ggtitle('Folded relative efficiency (zfsreff)') + 
  scale_y_continuous(breaks = seq(0,1,by=0.25))

Result: The relative efficiency of the standard deviation does see the problem as it focuses on scale.

Rank plot visualisation for a case with 4 chains and 250 draws per chain, and with one chain having a standard deviation of 0.75 as opposed to a standard deviation of 1 for the other chains.

iters = 250
chains = 4
scale = 0.75
r <- array(rnorm(iters * chains), c(iters, chains))
r[, 1] <- r[, 1] * scale
colnames(r) <- 1:4
mcmc_hist_r_scale(r)

Folded Rhat is \(1.06\), but the rank plot clearly shows where the problem is.

5.5 Appendix E: Cauchy: A distribution with infinite mean and variance

The classic split-Rhat is based on calculating within and between chain variances. If the marginal distribution of a chain is such that the variance is not defined (i.e. infinite), the classic split-Rhat is not well justified. In this section, we will use the Cauchy distribution as an example of such distribution. Also in cases where mean and variance are finite, the distribution can be far from Gaussian. Especially distributions with very long tails cause instability for variance and autocorrelation estimates. To alleviate these problems we will use Split-Rhat for rank-normalized draws.

The following Cauchy models are from Michael Betancourt’s case study Fitting The Cauchy Distribution

5.5.1 Nominal parameterization of Cauchy

The nominal Cauchy model with direct parameterization is as follows.

writeLines(readLines("cauchy_nom.stan"))
parameters {
  vector[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

5.5.1.1 Default Stan options

Run the nominal model:

fit_nom <- stan(file='cauchy_nom.stan', seed=7878, refresh = 0)
Warning: There were 1233 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems

Treedepth exceedence and Bayesian Fraction of Missing Information are dynamic HMC specific diagnostics (Betancourt, 2017). We get warnings about very large number of transitions after warmup that exceeded the maximum treedepth, which is likely due to very long tails of the Cauchy distribution. All chains have low estimated Bayesian fraction of missing information also indicating slow mixing.

MCMC trace for the first parameter looks wild with occasional large values

samp <- as.array(fit_nom) 
mcmc_trace(samp[, , 1])

Let’s check Rhat and relative efficiency diagnostics.

res <- monitor_extra(samp[, , 1:50])
which_min_eff <- which(res$zsreff == min(res$zsreff))
plot_rhat(res)

For one parameter, Rhats exceed the classic threshold of 1.1. Depending on the Rhat estimate, a few others also exceed the threshold 1.01. The rank normalized split-Rhat have several values over 1.01. Please note that the classic split-Rhat is not well defined in this example.

plot_reff(res) 

Both classic and new relative efficiency estimates have several near zero values, and the overall sample shouldn’t be trusted.

Result: Relative efficiency is more sensitive than (rank-normalized) split-Rhat to detect problems of slow mixing.

We also check lp__ and find out that the relative efficiency is worryingly low.

res <- monitor_extra(samp[, , 51:52]) 
cat('lp__: Bulk-R_eff = ', round(res['lp__', 'zsreff'], 2), '\n')
lp__: Bulk-R_eff =  0.03 
cat('lp__: Tail-R_eff = ', round(res['lp__', 'zfsreff'], 2), '\n')
lp__: Tail-R_eff =  0.12 

We can further analyze potential problems using local relative efficiency and rank plots. We examine x[36], which has the smallest bulk relative efficiency 0.03.

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates (see Section Relative efficiency of small interval probability estimates). Each interval contains \(1/k\) of the draws (e.g., with \(k=20\)). The small interval efficiency measures mixing of an indicator function which indicates when the values are inside the specific small interval. This gives us a local efficiency measure which does not depend on the shape of the distribution.

plot_local_reff(fit = fit_nom, par = which_min_eff, nalpha = 20)

We see that the efficiency of MCMC is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show chains with maximum treedepth.

Alternative way to examine the relative efficiency in different parts of the posterior, is to compute relative efficiency for quantiles (see Section Relative efficiency of quantiles). Each interval has specified proportion of draws, and the efficiency measures mixing of an indicator function which indicates when the values are inside the specific interval.

plot_quantile_reff(fit = fit_nom, par = which_min_eff, nalpha = 40)

We see that the efficiency of MCMC is worryingly low in the tails (which is caused by slow mixing in long tails of Cauchy). Orange ticks show chains with maximum treedepth.

We can further analyze potential problems using rank plots.

xmin <- paste0("x[", which_min_eff, "]")
mcmc_hist_r_scale(samp[, , xmin])

The chains clearly have different rank plots.

5.5.1.2 Default Stan options + max_treedepth=20

We can try to improve the performance by increasing max_treedepth.

fit_nom_td20 <- stan(
  file='cauchy_nom.stan', seed=7878, 
  refresh = 0, control = list(max_treedepth=20)
)

MCMC trace for the first parameter looks wild with occasional large values.

samp <- as.array(fit_nom_td20)
mcmc_trace(samp[, , 1])

res <- monitor_extra(samp[, , 1:50])
which_min_eff <- which(res$zsreff == min(res$zsreff))

We check the diagnostics for all x’s.

plot_rhat(res)

All Rhats are below \(1.1\), but many are over \(1.01\). Classic split-Rhat has more variation than the rank normalized (and the former is not well defined). The folded rank normalized Rhat shows that there is still more variation in scale than location between different chains.

plot_reff(res) 

Some classic R_eff’s are close to zero. If we wouldn’t realize that the variance is infinite, we might try to run longer chains, but in case of infinite variance, zero efficiency is the truth and longer chains won’t help. However other quantities can be well defined, and that’s why it is useful also to look at the rank normalized version as a generic transformation with finite mean and variance. The smallest bulk-\(R_{\rm eff}\) are around \(0.25\), which is not that bad. The smallest median-\(R_{\rm eff}\)’s are larger than \(0.5\), that is we are able to estimate the median efficiently. However, many tail-\(R_{\rm eff}\)’s are small indicating problmes for estimating scale.

Result: Rank normalized relative efficiency is more stable than classic relative efficiency, which is not well defined for Cauchy.

Result: it is useful to look at both bulk and tail relative efficiencies.

We check also lp__. Although increasing max_treedepth improved efficiency for bulk of x, the efficiency for lp__ didn’t change.

res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-R_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: Bulk-R_eff =  0.06 
cat('lp__: Tail-R_eff = ', round(res['lp__','zfsreff'],2), '\n')
lp__: Tail-R_eff =  0.18 

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates.

plot_local_reff(fit = fit_nom_td20, par = which_min_eff, nalpha = 20)

It seems that increasing max_treedepth has not much improved the efficiency in the tails.

We examine also the relative efficiency of different quantile estimates.

plot_quantile_reff(fit = fit_nom_td20, par = which_min_eff, nalpha = 40)

Rank plot visualisation of x[11], which has the smallest relative efficiency 0 among the x’s. The rank plot indicates clear convergence problems.

xmin <- paste0("x[", which_min_eff, "]")
mcmc_hist_r_scale(samp[, , xmin])

The rank plot visualisation of lp__, which has relative efficiency 0, doesn’t look so good either.

mcmc_hist_r_scale(samp[, , "lp__"])

5.5.1.3 Default Stan options + max_treedepth = 20 + 8 times longer sampling

Let’s try running longer chains.

fit_nom_td20l <- stan(
  file='cauchy_nom.stan', seed=7878, 
  refresh = 0, control = list(max_treedepth=20), 
  warmup=1000, iter = 9000
)
Warning: There were 7 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 20. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: Examine the pairs() plot to diagnose sampling problems

MCMC trace for the first parameter looks wild with occasional large values

samp <- as.array(fit_nom_td20l)
mcmc_trace(samp[, , 1])

res <- monitor_extra(samp[, , 1:50])
which_min_eff <- which(res$zsreff == min(res$zsreff))

Check the diagnostics for all x’s.

plot_rhat(res)

All Rhats are below \(1.01\). Classic split-Rhat has more variation than the rank normalized Rhat (but the former is not well defined).

plot_reff(res) 

Most classic R_eff’s are close to zero. Running longer chains just made most classic R_eff’s smaller.

The smallest bulk-\(R_{\rm eff}\) are around \(0.25\), which is not that bad. The smallest median-\(R_{\rm eff}\)’s are larger than \(0.75\), that is we are able to estimate the median efficiently. However, many tail-\(R_{\rm eff}\)’s are small indicating problmes for estimating scale.

Result: Rank normalized relative efficiency is more stable than classic relative efficiency, which is not well defined for Cauchy.

Result: it is useful to look at both bulk and tail relative efficiencies.

We check also lp__. Although increasing the number of iterations improved efficiency for bulk of x, the efficiency for lp__ didn’t change.

res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-R_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: Bulk-R_eff =  0.04 
cat('lp__: Tail-R_eff = ', round(res['lp__','zfsreff'],2), '\n')
lp__: Tail-R_eff =  0.1 

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates.

plot_local_reff(fit = fit_nom_td20l, par = which_min_eff, nalpha = 20)

Increasing chain length did not seem to change the relative efficiency. With more draws from the longer chains we can use finer resolution for the local efficiency estimates.

plot_local_reff(fit = fit_nom_td20l, par = which_min_eff, nalpha = 100)

The efficiency far in the tails is worryingly low. This was more difficult to see previously with less draws from the tails. We see similar problems in the plot of relative efficiency of quantiles.

plot_quantile_reff(fit = fit_nom_td20l, par = which_min_eff, nalpha = 100)

The rank plot visualisation of x[39], which has the smallest relative efficiency 0.04 among the x’s.

xmin <- paste0("x[", which_min_eff, "]")
mcmc_hist_r_scale(samp[, , xmin])

Increasing the number of iterations couldn’t remove the mixing problems at the edges. The mixing problem is inherent to Cauchy and nominal parameterization.

The rank plot visualisation of lp__, which has relative efficiency 0.04, looks as follows:

mcmc_hist_r_scale(samp[, , "lp__"])

There seems to be slight problems also in mixing for energy.

5.5.2 First alternative parameterization of Cauchy

Next we examine alternative parameterization and consider Cauchy as a scale mixture of Gaussian distributions. The model has two parameters and the Cauchy distributed x’s can be computed from those. In addition of improved sampling performance, the example illustrates that focus of diagnostics matter.

writeLines(readLines("cauchy_alt_1.stan"))
parameters {
  vector[50] x_a;
  vector<lower=0>[50] x_b;
}

transformed parameters {
  vector[50] x = x_a ./ sqrt(x_b);
}

model {
  x_a ~ normal(0, 1);
  x_b ~ gamma(0.5, 0.5);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Run the alternative model:

fit_alt1 <- stan(file='cauchy_alt_1.stan', seed=7878, refresh = 0)

There are no warnings, and the sampling is much faster.

samp <- as.array(fit_alt1)
res <- monitor_extra(samp[, , 101:150])
which_min_eff <- which(res$zsreff == min(res$zsreff))
plot_rhat(res)

All Rhats are below 1.01. Classic split-Rhat’s look also good even if the classic is not well defined for Cauchy distribution.

plot_reff(res) 

Result: Rank normalized R_eff’s have less variation than classic one which is not well defined for Cauchy.

We check lp__:

res <- monitor_extra(samp[, , 151:152])
cat('lp__: Bulk-R_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: Bulk-R_eff =  0.33 
cat('lp__: Tail-R_eff = ', round(res['lp__','zfsreff'],2), '\n')
lp__: Tail-R_eff =  0.59 

The relative efficiencies for lp__ are also much better than with nominal parameterization.

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates.

plot_local_reff(fit = fit_alt1, par = 100+which_min_eff, nalpha = 20)

The relative efficiency is good in all parts of the posterior.

We examine also the relative efficiency of different quantile estimates.

plot_quantile_reff(fit = fit_alt1, par = 100+which_min_eff, nalpha = 40)

We compare mean relative efficiencies of underlying parameters in the new parameterization and the actual x we are interested in.

res <- monitor_extra(samp[, , 101:150])
res1 <- monitor_extra(samp[, , 1:50])
res2 <- monitor_extra(samp[, , 51:100])
cat('Mean bulk-R_eff for x\'s = ' , round(mean(res[,'zsreff']), 2), '\n')
Mean bulk-R_eff for x's =  0.91 
cat('Mean tail-R_eff for x\'s = ' , round(mean(res[,'zfsreff']), 2), '\n')
Mean tail-R_eff for x's =  0.54 
cat('Mean bulk-R_eff for x_a\'s = ' , round(mean(res1[,'zsreff']), 2), '\n')
Mean bulk-R_eff for x_a's =  0.99 
cat('Mean bulk-R_eff for x_b\'s = ' , round(mean(res2[,'zsreff']), 2), '\n')
Mean bulk-R_eff for x_b's =  0.69 

Result: We see that the relative efficiency of the interesting \(x\) can be different from the relative efficiency of the parameters \(x_a\) and \(x_b\) that we use to compute it.

Rank plot visualisation of x[10], which has the smallest relative efficiency 0.72 among x’s.

xmin <- paste0("x[", which_min_eff, "]")
mcmc_hist_r_scale(samp[, , xmin])

Looks better than for the nominal parameterization.

Rank plot visualisation of lp__, which has a relative efficiency of -81.34, 0.23, 8.08, -95.19, -80.99, -68.66, 1288.4, 0.32, 1302.86, 1295.92, 1310.35, 0.33, 1, 1, 1, 1, 1, 2365.93, 0.59, 1708.03, 0.43, 2912.14, 0.73.

mcmc_hist_r_scale(samp[, , "lp__"])

Looks better than for the nominal parameterization.

5.5.3 Another alternative parameterization of the Cauchy distribution

Another alternative parameterization is univariate transformation as shown in the following code (3rd alternative in Michael Betancourt’s case study).

writeLines(readLines("cauchy_alt_3.stan"))
parameters {
  vector<lower=0, upper=1>[50] x_tilde;
}

transformed parameters {
vector[50] x = tan(pi() * (x_tilde - 0.5));
}

model {
  // Implicit uniform prior on x_tilde
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Run the alternative model:

fit_alt3 <- stan(file='cauchy_alt_3.stan', seed=7878, refresh = 0)

There are no warnings, and the sampling is much faster than for the nominal model.

samp <- as.array(fit_alt3)
res <- monitor_extra(samp[, , 51:100])
which_min_eff <- which(res$zsreff == min(res$zsreff))
plot_rhat(res)

All Rhats except some folded Rhats are below 1.01. Classic split-Rhat’s look also good even if it is not well defined for the Cauchy distribution.

plot_reff(res) 

Result: Rank normalized relative efficiencies have less variation than classic ones which is not well defined for Cauchy. Bulk-\(R_{\rm eff}\) and median-\(R_{\rm eff}\) are slightly larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.

We check lp__:

res <- monitor_extra(samp[, , 101:102])
cat('lp__: Bulk-R_eff = ', round(res['lp__','zsreff'],2), '\n')
lp__: Bulk-R_eff =  0.37 
cat('lp__: Tail-R_eff = ', round(res['lp__','zfsreff'],2), '\n')
lp__: Tail-R_eff =  0.56 

The relative efficiency for these are also much better than with the nominal parameterization.

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates.

plot_local_reff(fit = fit_alt3, par = 50+which_min_eff, nalpha = 20)

We examine also the relative efficiency of different quantile estimates.

plot_quantile_reff(fit = fit_alt3, par = 50+which_min_eff, nalpha = 40)

The relative efficiency in tails is worse than for the first alternative parameterization, although it’s still better than for the nominal model.

We compare mean relative efficiency of underlying parameter in the new parameterization and the actual x we are interested in.

res <- monitor_extra(samp[, , 51:100])
res1 <- monitor_extra(samp[, , 1:50])
cat('Mean bulk-R_eff for x\'s = ' , round(mean(res[,'zsreff']),2), '\n')
Mean bulk-R_eff for x's =  1.18 
cat('Mean tail-R_eff for x\'s = ' , round(mean(res[,'zfsreff']),2), '\n')
Mean tail-R_eff for x's =  0.4 
cat('Mean bulk-R_eff for x_tilde\'s = ' , round(mean(res1[,'zsreff']),2), '\n')
Mean bulk-R_eff for x_tilde's =  1.18 

Result: When the transformation is univariate and bijective (x = tan(pi() * (x_tilde - 0.5))), the rank normalized Rhat and R_eff are transformation invariant.

Naturally when computing the desired expectations, the final target function should be taken into account.

Rank plot visualisation of x[45], which has the smallest relative efficiency of 0.99 among x’s.

xmin <- paste0("x[", which_min_eff, "]")
mcmc_hist_r_scale(samp[, , xmin])

Nothing special.

Rank plot visualisation of lp__, which has relative efficiency 0.33.

mcmc_hist_r_scale(samp[, , "lp__"])

Nothing special.

5.5.4 Half-Cauchy with nominal parameterization

Half-Cauchy priors are common and, for example, in Stan usually set using the nominal parameterization. However, when the constraint <lower=0> is used, Stan does the sampling automatically in the unconstrained log(x) space, which changes the geometry crucially.

writeLines(readLines("half_cauchy_nom.stan"))
parameters {
  vector<lower=0>[50] x;
}

model {
  x ~ cauchy(0, 1);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Run the half-Cauchy with nominal parameterization (and positive constraint).

fit_half_nom <- stan(file='half_cauchy_nom.stan', seed=7878, refresh = 0)

There are no warnings, and the sampling is much faster than for the Cauchy nominal model.

samp <- as.array(fit_half_nom)
res <- monitor_extra(samp[, , 1:50])
which_min_eff <- which(res$zsreff == min(res$zsreff))
plot_rhat(res) 

All Rhats are below 1.01. Classic split-Rhat’s look also good even if the classic is not well defined for half-Cauchy distribution.

plot_reff(res)  

Result: Rank normalized R_eff’s have less variation than classic one which is not well defined for Cauchy. Bulk-\(R_{\rm eff}\) and median-\(R_{\rm eff}\) are larger than 1, which is possible for antithetic Markov chains which have negative correlation for odd lags.

Due to constraint <lower=0>, the sampling was made in log(x) space, and we can also check the performance in that space.

res <- monitor_extra(log(samp[, , 1:50]))
plot_reff(res) 

\(\log(x)\) is quite close to Gaussian, and thus classic R_eff is also close to rank normalized R_eff which is exactly same as for \(x\) as the rank normalized version is invariant to bijective transformations.

Result: Rank normalized relative efficiency is close to classic relative efficiency for transformations which make the distribution close to Gaussian.

We check lp__:

res <- monitor_extra(samp[, , 51:52])
cat('lp__: Bulk-R_eff = ', round(res['lp__', 'zsreff'], 2), '\n')
lp__: Bulk-R_eff =  0.3 
cat('lp__: Tail-R_eff = ', round(res['lp__', 'zfsreff'], 2), '\n')
lp__: Tail-R_eff =  0.55 

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates.

plot_local_reff(fit = fit_half_nom, par = which_min_eff, nalpha = 20)

The relative efficiency is good overall, with only a small dip in tails.

We examine also the relative efficiency of different quantile estimates.

plot_quantile_reff(fit = fit_half_nom, par = which_min_eff, nalpha = 40)

Rank plot visualisation of x[29], which has the smallest relative efficiency of 0.3 among x’s.

xmin <- paste0("x[", which_min_eff, "]")
mcmc_hist_r_scale(samp[, , xmin])

Nothing special.

Rank plot visualisation of lp__, which has relative efficiency 0.33.

mcmc_hist_r_scale(samp[, , "lp__"])

Maybe small differences in scales, but it’s difficult to know whether small variation from uniform is relevant.

5.5.5 Alternative parameterization of half-Cauchy

writeLines(readLines("half_cauchy_alt.stan"))
parameters {
  vector<lower=0>[50] x_a;
  vector<lower=0>[50] x_b;
}

transformed parameters {
  vector[50] x = x_a .* sqrt(x_b);
}

model {
  x_a ~ normal(0, 1);
  x_b ~ inv_gamma(0.5, 0.5);
}

generated quantities {
  real I = fabs(x[1]) < 1 ? 1 : 0;
}

Run half-Cauchy with alternative parameterization

fit_half_reparam <- stan(
  file = 'half_cauchy_alt.stan', seed = 7878, refresh = 0
)

There are no warnings, and the sampling is as fast for the half-Cauchy nominal model.

samp <- as.array(fit_half_reparam)
res <- monitor_extra(samp[, , 101:150])
which_min_eff <- which.min(res$zsreff)
plot_rhat(res)

All Rhats are below 1.01. Classic split-Rhat’s look also good even if the classic is not well defined for half-Cauchy distribution.

plot_reff(res) 

Result: Rank normalized relative efficiencies have less variation than classic ones which is not well defined for Cauchy. Based on bulk-\(R_{\rm eff}\) and median-\(R_{\rm eff}\) the efficency for central quantities is much lower, but based on tail- \(R_{\rm eff}\) and MAD-\(R_{\rm eff}\) the efficency in tails is slightly better. We also see that parameterization which is good for Cauchy is not necessary good for half-Cauchy as constraint <lower=0> adds additional change in the parameterization.

We check lp__:

res <- monitor_extra(samp[, , 151:152])
cat('lp__: Bulk-R_eff = ', round(res['lp__', 'zsreff'],2), '\n')
lp__: Bulk-R_eff =  0.24 
cat('lp__: Tail-R_eff = ', round(res['lp__', 'zfsreff'],2), '\n')
lp__: Tail-R_eff =  0.58 

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates.

plot_local_reff(fit_half_reparam, par = 100 + which_min_eff, nalpha = 20)

We examine also the relative efficiency of different quantile estimates.

plot_quantile_reff(fit_half_reparam, par = 100 + which_min_eff, nalpha = 40)

The relative efficiency near zero is much worse than for the half-Cauchy with nominal parameterization and constraint <lower=0>.

Rank plot visualisation of x[1], which has the smallest relative efficiency 0.24 among x’s.

xmin <- paste0("x[", which_min_eff, "]")
mcmc_hist_r_scale(samp[, , xmin])

The rank plots seem to be different from uniform, which is expected with lower relative efficiency.

Rank plot visualisation of lp__, which has relative efficiency 0.24.

mcmc_hist_r_scale(samp[, , "lp__"])

The rank plot is different from uniform, which is expected with lower relative efficiency, but would require a reference.

5.6 Appendix F: Hierarchical model: Eight Schools

Eight Schools data is classic example for hierarchical models (Section 5.5, Gelman et al. (2013)), which despite the apparent simplicity nicely illustrates the typical problems in inference for hierarchical models. The Stan models below are from Michael Betancourt’s case study on Diagnosing Biased Inference with Divergences.

5.6.1 A Centered Eight Schools model

writeLines(readLines("eight_schools_cp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta[J];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta ~ normal(mu, tau);
  y ~ normal(theta, sigma);
}

5.6.1.1 Centered parameterization default MCMC options

Run centered parameterization model with default MCMC options.

input_data <- read_rdump("eight_schools.data.R")
fit_cp <- stan(
  file = 'eight_schools_cp.stan', data = input_data,
  iter = 2000, chains = 4, seed = 483892929, refresh = 0
)
Warning: There were 147 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems

We do observe divergent transitions, which is sensitive diagnostic for this model and we observer also low BFMI, so we already know that the results can’t be trusted. We can use Rhat and Reff diagnostics also for other MCMC algorithms, but they can also be helpful to recognize problematic parts of the posterior.

sel <- c(
  'sRhat', 'zsRhat', 'zfsRhat', 'reff',
  'zsreff', 'zfsreff', 'medsreff', 'madsreff'
)
res <- monitor_extra(fit_cp)
round(res[, sel], 2)
Inference for the input samples ( chains: each with iter = ; warmup = ):

         sRhat zsRhat zfsRhat reff zsreff zfsreff medsreff madsreff
mu        1.00   1.00    1.01 0.15   0.15    0.05     0.12     0.09
tau       1.01   1.03    1.00 0.06   0.03    0.21     0.08     0.11
theta[1]  1.00   1.00    1.01 0.22   0.21    0.05     0.11     0.09
theta[2]  1.00   1.00    1.02 0.22   0.21    0.04     0.11     0.10
theta[3]  1.00   1.00    1.01 0.30   0.28    0.06     0.13     0.11
theta[4]  1.00   1.00    1.02 0.24   0.22    0.05     0.12     0.09
theta[5]  1.00   1.00    1.01 0.27   0.25    0.11     0.14     0.11
theta[6]  1.00   1.00    1.02 0.29   0.27    0.05     0.14     0.08
theta[7]  1.00   1.00    1.01 0.18   0.17    0.12     0.09     0.11
theta[8]  1.00   1.00    1.01 0.27   0.24    0.08     0.15     0.10
lp__      1.04   1.04    1.02 0.02   0.03    0.05     0.07     0.12

For each parameter, Bulk_Reff and Tail_Reff are crude measures of relative
effective sample size for bulk and tail quantities respectively (good mixing
Reff > 0.1), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

Some rank-normalized split-Rhats are larger than 1.01. Bulk-\(R_{\rm eff}\) for tau and lp__ are less than 10%, which is worryingly low and longer chains should be run.

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates for tau.

plot_local_reff(fit = fit_cp, par = "tau", nalpha = 20)

We see that MCMC has difficulties in exploring small tau values. As the efficiency for estimating small tau values is practically zero, we may assume that we may miss substantial amount of posterior mass and get biased estimates. Red ticks which show chains with divergences have concentrated to small tau values, indicating problems exploring even smaller values which is likely to cause bias.

We examine also the relative efficiency of different quantile estimates.

plot_quantile_reff(fit = fit_cp, par = "tau", nalpha = 40)

Most of the quantile estimates have worryingly low relative efficiency.

Rank plot visualisation of tau, which has the smallest relative efficiency 0.05 among parameters.

samp_cp <- as.array(fit_cp)
mcmc_hist_r_scale(samp_cp[, , "tau"])

We observe clear problems.

Rank plot visualisation of lp__, which has relative efficiency 0.05.

mcmc_hist_r_scale(samp_cp[, , "lp__"])

We observe clear problems.

As we observed clear problems we shouldn’t trust any Monte Carlo error estimates, but for illustration we show MCSE, interval and corresponding S_eff for median of mu and tau.

round(quantile_mcse(samp_cp[, , "mu"], prob = 0.5), 2)
  mcse  q05  q95   Seff
1 0.08 3.95 4.31 472.33
round(quantile_mcse(samp_cp[, , "tau"], prob = 0.5), 2)
  mcse q05  q95   Seff
1 0.19 2.7 3.32 306.86

The proposed default output from Stan would look as follows, with some Rhat>1.01 and Reff<0.1 indicating potential convergence problem.

monitor_extra(fit_cp)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

           mean se_mean   sd     Q5    Q50   Q95 neff reff sneff zneff zsneff zsreff Rhat sRhat
mu         4.15    0.13 3.25  -1.06   4.03  9.69  618 0.15   624   597    602   0.15 1.00  1.00
tau        3.80    0.18 2.95   0.81   2.98  9.58  259 0.06   308    87    114   0.03 1.01  1.01
theta[1]   6.08    0.19 5.66  -1.66   5.25 16.38  886 0.22   895   827    832   0.21 1.00  1.00
theta[2]   4.59    0.16 4.67  -2.71   3.93 13.04  885 0.22   908   828    843   0.21 1.00  1.00
theta[3]   3.76    0.15 5.20  -5.10   3.92 12.09 1217 0.30  1222  1101   1109   0.28 1.00  1.00
theta[4]   4.49    0.16 4.85  -3.23   4.09 12.87  956 0.24   952   874    874   0.22 1.00  1.00
theta[5]   3.26    0.14 4.62  -4.42   3.50 10.76 1078 0.27  1093   970    982   0.25 1.00  1.00
theta[6]   3.76    0.14 4.71  -4.00   3.73 11.51 1163 0.29  1175  1059   1080   0.27 1.00  1.00
theta[7]   6.03    0.19 4.97  -1.08   5.06 14.93  716 0.18   726   659    666   0.17 1.00  1.00
theta[8]   4.58    0.17 5.40  -3.86   4.34 13.67 1072 0.27  1097   950    968   0.24 1.00  1.00
lp__     -15.11    0.62 5.87 -24.31 -15.32 -5.29   91 0.02   122    85    115   0.03 1.03  1.04
         zRhat zsRhat zfsRhat zfsneff zfsreff medsneff medsreff madsneff madsreff
mu        1.00   1.00    1.01     193    0.05      472     0.12      353     0.09
tau       1.03   1.03    1.00     826    0.21      307     0.08      442     0.11
theta[1]  1.00   1.00    1.01     203    0.05      455     0.11      375     0.09
theta[2]  1.00   1.00    1.02     163    0.04      433     0.11      388     0.10
theta[3]  1.00   1.00    1.01     247    0.06      534     0.13      437     0.11
theta[4]  1.00   1.00    1.02     205    0.05      460     0.12      358     0.09
theta[5]  1.00   1.00    1.01     420    0.11      569     0.14      443     0.11
theta[6]  1.00   1.00    1.02     197    0.05      548     0.14      317     0.08
theta[7]  1.00   1.00    1.01     475    0.12      354     0.09      446     0.11
theta[8]  1.00   1.00    1.01     332    0.08      589     0.15      389     0.10
lp__      1.03   1.04    1.02     217    0.05      290     0.07      471     0.12

5.6.1.2 Centered parameterization with longer chains

Low R_eff can be sometimes compensated with longer chains. Let’s check 10 times longer chain.

fit_cp2 <- stan(
  file = 'eight_schools_cp.stan', data = input_data,
  iter = 20000, chains = 4, seed =  483892929, refresh =  0
)
Warning: There were 1346 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems
res <- monitor_extra(fit_cp2)
round(res[, sel], 2)
Inference for the input samples ( chains: each with iter = ; warmup = ):

         sRhat zsRhat zfsRhat reff zsreff zfsreff medsreff madsreff
mu        1.00   1.00       1 0.03   0.03    0.08     0.06     0.09
tau       1.00   1.01       1 0.04   0.01    0.16     0.05     0.07
theta[1]  1.00   1.00       1 0.07   0.05    0.09     0.06     0.07
theta[2]  1.00   1.00       1 0.06   0.05    0.12     0.06     0.07
theta[3]  1.00   1.00       1 0.13   0.10    0.17     0.06     0.08
theta[4]  1.00   1.00       1 0.08   0.07    0.15     0.05     0.08
theta[5]  1.00   1.00       1 0.13   0.11    0.17     0.06     0.08
theta[6]  1.00   1.00       1 0.09   0.08    0.15     0.06     0.08
theta[7]  1.00   1.00       1 0.05   0.04    0.08     0.07     0.07
theta[8]  1.00   1.00       1 0.09   0.07    0.14     0.06     0.07
lp__      1.01   1.01       1 0.02   0.02    0.03     0.05     0.07

For each parameter, Bulk_Reff and Tail_Reff are crude measures of relative
effective sample size for bulk and tail quantities respectively (good mixing
Reff > 0.1), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

Some rank-normalized split-Rhats are still larger than 1.01. Bulk-\(R_{\rm eff}\) for tau and lp__ are around 1%. A drop in the relative efficiency when increasing the number of iterations indicates serious problems in mixing.

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates for tau.

plot_local_reff(fit = fit_cp2, par = "tau", nalpha = 50)

We see that MCMC has difficulties in exploring small tau values. As the efficiency for estimating small tau values is practically zero, we may assume that we may miss substantial amount of posterior mass and get biased estimates.

We examine also the relative efficiency of different quantile estimates.

plot_quantile_reff(fit = fit_cp2, par = "tau", nalpha = 100)

Most of the quantile estimates have worryingly low relative efficiency and many practically zero efficiency indicating clear failure in mixing.

We examine also small probability interval efficiency for mu in the following plot.

plot_local_reff(fit = fit_cp2, par = "mu", nalpha = 50)

There are gaps of poor efficiency which indicates sticking of MCMC, but the sticking doesn’t occur for any specific range of values of mu as for tau.

We examine also the relative efficiency of different quantiles of mu

plot_quantile_reff(fit = fit_cp2, par = "mu", nalpha = 100)

We see clear failure in mixing.

Rank plot visualisation of tau, which has the smallest relative efficiency 0.01 among parameters.

samp_cp2 <- as.array(fit_cp2)
mcmc_hist_r_scale(samp_cp2[, , "tau"])

We observe clear problems to sample small values of tau and sticking of the 4th chain.

Rank plot visualisation of lp__, which has relative efficiency 0.01.

mcmc_hist_r_scale(samp_cp2[, , "lp__"])

We observe clear problems sampling different energy levels, which are connected to values of tau.

As we observed clear problems we shouldn’t trust any Monte Carlo error estimates, but for illustration we show MCSE, interval and corresponding S_eff for median of mu and tau. Comparing to the shorter MCMC run, 10 times more draws has not reduced the MCSE to one third as would be expected without problems in in mixing.

round(quantile_mcse(samp_cp2[ , , "mu"], prob = 0.5),2)
  mcse  q05  q95    Seff
1 0.09 4.17 4.47 2235.27
round(quantile_mcse(samp_cp2[ , , "tau"], prob = 0.5),2)
  mcse  q05  q95    Seff
1 0.08 2.98 3.24 1904.11

The proposed default output from Stan would look as follows, with some Rhat>1.01 and Reff<=0.01 indicating convergence problem.

monitor_extra(fit_cp2)
Inference for the input samples (4 chains: each with iter = 20000; warmup = 10000):

           mean se_mean   sd     Q5    Q50   Q95 neff reff sneff zneff zsneff zsreff Rhat sRhat
mu         4.24    0.09 3.36  -1.28   4.32  9.71 1343 0.03  1331  1383   1367   0.03    1  1.00
tau        3.96    0.08 3.18   0.74   3.10 10.21 1724 0.04  1725   497    491   0.01    1  1.00
theta[1]   6.23    0.11 5.83  -1.80   5.76 16.53 2801 0.07  2720  1927   1875   0.05    1  1.00
theta[2]   4.83    0.10 4.88  -2.73   4.80 12.89 2557 0.06  2572  2205   2193   0.05    1  1.00
theta[3]   3.71    0.08 5.43  -5.25   3.99 12.01 5069 0.13  5127  3777   3814   0.10    1  1.00
theta[4]   4.67    0.09 4.98  -3.15   4.70 12.81 3332 0.08  3313  2750   2731   0.07    1  1.00
theta[5]   3.38    0.07 4.83  -4.85   3.61 10.74 5210 0.13  5236  4440   4437   0.11    1  1.00
theta[6]   3.81    0.08 5.00  -4.44   4.03 11.65 3622 0.09  3557  3097   3033   0.08    1  1.00
theta[7]   6.36    0.12 5.26  -1.23   5.97 15.80 1947 0.05  1957  1572   1553   0.04    1  1.00
theta[8]   4.78    0.09 5.48  -3.51   4.71 13.60 3798 0.09  3843  2913   2915   0.07    1  1.00
lp__     -15.32    0.24 6.12 -25.03 -15.56 -4.77  671 0.02   668   646    643   0.02    1  1.01
         zRhat zsRhat zfsRhat zfsneff zfsreff medsneff medsreff madsneff madsreff
mu           1   1.00       1    3320    0.08     2235     0.06     3553     0.09
tau          1   1.01       1    6224    0.16     1904     0.05     2991     0.07
theta[1]     1   1.00       1    3529    0.09     2479     0.06     2736     0.07
theta[2]     1   1.00       1    4625    0.12     2469     0.06     2786     0.07
theta[3]     1   1.00       1    6891    0.17     2441     0.06     3316     0.08
theta[4]     1   1.00       1    5857    0.15     2186     0.05     3025     0.08
theta[5]     1   1.00       1    6998    0.17     2403     0.06     3210     0.08
theta[6]     1   1.00       1    6198    0.15     2304     0.06     3208     0.08
theta[7]     1   1.00       1    3094    0.08     2643     0.07     2697     0.07
theta[8]     1   1.00       1    5487    0.14     2231     0.06     2915     0.07
lp__         1   1.01       1    1002    0.03     1981     0.05     2724     0.07

5.6.1.3 Centered parameterization very long chains

And for further evidence let’s check 100 times longer chains than the default.

fit_cp3 <- stan(
  file='eight_schools_cp.stan', data=input_data,
  iter=200000, chains=4, seed=483892929, refresh=0
)
Warning: There were 15980 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems
res <- monitor_extra(fit_cp3)
round(res[, sel], 2)
Inference for the input samples ( chains: each with iter = ; warmup = ):

         sRhat zsRhat zfsRhat reff zsreff zfsreff medsreff madsreff
mu           1      1       1 0.03   0.03    0.04     0.02     0.02
tau          1      1       1 0.02   0.00    0.05     0.02     0.02
theta[1]     1      1       1 0.05   0.04    0.04     0.02     0.02
theta[2]     1      1       1 0.06   0.05    0.03     0.02     0.03
theta[3]     1      1       1 0.08   0.06    0.04     0.02     0.02
theta[4]     1      1       1 0.07   0.05    0.04     0.02     0.02
theta[5]     1      1       1 0.06   0.05    0.02     0.02     0.02
theta[6]     1      1       1 0.08   0.07    0.03     0.02     0.02
theta[7]     1      1       1 0.04   0.04    0.03     0.02     0.02
theta[8]     1      1       1 0.07   0.05    0.05     0.02     0.03
lp__         1      1       1 0.01   0.00    0.01     0.02     0.02

For each parameter, Bulk_Reff and Tail_Reff are crude measures of relative
effective sample size for bulk and tail quantities respectively (good mixing
Reff > 0.1), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

Some rank-normalized split-Rhats are still larger than 1.01. Bulk-\(R_{\rm eff}\) for tau and lp__ are around 0% and relative efficiencies for other parameters are also very small. Drop in the relative efficiency when the the sample size is increased indicates serious problems in mixing.

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates for tau.

plot_local_reff(fit = fit_cp3, par = "tau", nalpha = 100)

We see that MCMC has difficulties in exploring small tau values. As the efficiency for estimating small tau values is practically zero, we may assume that we may miss substantial amount of posterior mass and get biased estimates. It is good to note that the low efficiency for small tau values is due to too large step size, and the low efficiency for large tau values is partially due to too small step size, but even more importantly due to the random walk in energy space.

We examine also the relative efficiency of different quantile estimates.

plot_quantile_reff(fit = fit_cp3, par = "tau", nalpha = 100)

Most of the quantile estimates have worryingly low relative efficiency and many practically zero efficiency indicating clear failure in mixing.

Rank plot visualisation of tau, which has the smallest relative efficiency 0.00 among parameters.

samp_cp3 <- as.array(fit_cp3)
mcmc_hist_r_scale(samp_cp3[, , 2])

We observe clear problems to sample small values of tau. Even with 100000 draws per chain, the plots don’t get crowded as traceplots would, and sticking of chains is easy to see.

Rank plot visualisation of lp__, which has relative efficiency 0.01.

mcmc_hist(r_scale(samp_cp3[,,11]), breaks = seq(0,prod(dim(samp_cp3)[1:2]),by=prod(dim(samp_cp3)[1:2])/100)+0.5)

We observe clear problems sampling different energy levels and sticking in all chains.

As we observed clear problems we shouldn’t trust any Monte Carlo error estimates, but for illustration we show MCSE, interval and corresponding S_eff for median of mu and tau. Comparing to the shorter MCMC run, 100 times more draws has not reduced the MCSE to one tenths as would be expected without problems in in mixing.

round(quantile_mcse(samp_cp3[ , , "mu"], prob = 0.5),2)
  mcse  q05  q95    Seff
1 0.05 4.23 4.39 7862.67
round(quantile_mcse(samp_cp3[ , , "tau"], prob = 0.5),2)
  mcse q05  q95    Seff
1 0.04 3.1 3.22 7835.81

The proposed default output from Stan would look as follows, with some Rhat>1.01 and Reff<0.01 indicating convergence problem.

monitor_extra(fit_cp3)
Inference for the input samples (4 chains: each with iter = 2e+05; warmup = 1e+05):

           mean se_mean   sd     Q5    Q50   Q95  neff reff sneff zneff zsneff zsreff Rhat sRhat
mu         4.31    0.03 3.31  -1.17   4.31  9.69 11890 0.03 11864 11802  11778   0.03    1     1
tau        4.02    0.04 3.16   0.82   3.16 10.14  7167 0.02  7162  1471   1461   0.00    1     1
theta[1]   6.33    0.04 5.77  -1.68   5.73 16.64 20846 0.05 20751 15526  15424   0.04    1     1
theta[2]   4.91    0.03 4.82  -2.72   4.80 12.89 25847 0.06 25794 21624  21615   0.05    1     1
theta[3]   3.77    0.03 5.42  -5.28   4.03 11.94 33590 0.08 33718 25413  25426   0.06    1     1
theta[4]   4.72    0.03 4.94  -3.14   4.64 12.77 26755 0.07 26876 21801  21969   0.05    1     1
theta[5]   3.43    0.03 4.78  -4.74   3.66 10.73 23200 0.06 23063 19433  19304   0.05    1     1
theta[6]   3.91    0.03 4.97  -4.43   4.08 11.61 32724 0.08 32710 26293  26270   0.07    1     1
theta[7]   6.45    0.04 5.24  -1.06   5.98 15.87 17430 0.04 17392 14358  14319   0.04    1     1
theta[8]   4.81    0.03 5.48  -3.70   4.71 13.71 28891 0.07 28967 21469  21475   0.05    1     1
lp__     -15.55    0.13 5.90 -25.08 -15.69 -5.58  2201 0.01  2186  1835   1823   0.00    1     1
         zRhat zsRhat zfsRhat zfsneff zfsreff medsneff medsreff madsneff madsreff
mu           1      1       1   16123    0.04     7863     0.02     9054     0.02
tau          1      1       1   18719    0.05     7836     0.02     7683     0.02
theta[1]     1      1       1   14193    0.04     8090     0.02     9446     0.02
theta[2]     1      1       1   12923    0.03     8100     0.02    11211     0.03
theta[3]     1      1       1   17915    0.04     8605     0.02     9693     0.02
theta[4]     1      1       1   16036    0.04     7352     0.02     9854     0.02
theta[5]     1      1       1    9446    0.02     9113     0.02     9196     0.02
theta[6]     1      1       1   11065    0.03     9289     0.02     9751     0.02
theta[7]     1      1       1   12039    0.03     8087     0.02     9759     0.02
theta[8]     1      1       1   21912    0.05     7117     0.02    10790     0.03
lp__         1      1       1    2610    0.01     7811     0.02     8068     0.02

5.6.2 Non-centered Eight Schools model

For hierarchical models, non-centered parameterization often works better

writeLines(readLines("eight_schools_ncp.stan"))
data {
  int<lower=0> J;
  real y[J];
  real<lower=0> sigma[J];
}

parameters {
  real mu;
  real<lower=0> tau;
  real theta_tilde[J];
}

transformed parameters {
  real theta[J];
  for (j in 1:J)
    theta[j] = mu + tau * theta_tilde[j];
}

model {
  mu ~ normal(0, 5);
  tau ~ cauchy(0, 5);
  theta_tilde ~ normal(0, 1);
  y ~ normal(theta, sigma);
}

5.6.2.1 Non-centered parameterization default MCMC options

Run non-centered parameterization model with default options.

input_data <- read_rdump("eight_schools.data.R")
fit_ncp <- stan(
  file='eight_schools_ncp.stan', data=input_data,
  iter=2000, chains=4, seed=483892929, refresh=0
)
Warning: There were 2 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: Examine the pairs() plot to diagnose sampling problems

We still observe some divergent transitions with the default adapt_delta=0.8. Let’s analyze the sample.

res <- monitor_extra(fit_ncp)
round(res[, sel], 2)
Inference for the input samples ( chains: each with iter = ; warmup = ):

               sRhat zsRhat zfsRhat reff zsreff zfsreff medsreff madsreff
mu                 1      1       1 1.01   1.02    0.44     1.06     0.59
tau                1      1       1 0.67   0.58    0.78     0.79     0.79
theta_tilde[1]     1      1       1 1.13   1.14    0.54     1.12     0.62
theta_tilde[2]     1      1       1 1.44   1.44    0.50     1.34     0.60
theta_tilde[3]     1      1       1 1.23   1.24    0.53     1.35     0.61
theta_tilde[4]     1      1       1 1.34   1.36    0.50     1.21     0.60
theta_tilde[5]     1      1       1 1.06   1.07    0.57     1.07     0.67
theta_tilde[6]     1      1       1 1.30   1.30    0.52     1.19     0.60
theta_tilde[7]     1      1       1 0.97   0.97    0.57     0.96     0.70
theta_tilde[8]     1      1       1 1.21   1.21    0.48     1.09     0.50
theta[1]           1      1       1 0.88   0.95    0.63     1.09     0.70
theta[2]           1      1       1 1.22   1.25    0.58     1.06     0.64
theta[3]           1      1       1 0.95   1.00    0.64     1.00     0.74
theta[4]           1      1       1 1.14   1.17    0.66     1.14     0.71
theta[5]           1      1       1 1.03   1.08    0.70     1.13     0.67
theta[6]           1      1       1 1.18   1.24    0.67     1.19     0.79
theta[7]           1      1       1 1.10   1.15    0.62     1.08     0.73
theta[8]           1      1       1 1.01   1.12    0.62     1.10     0.66
lp__               1      1       1 0.42   0.43    0.62     0.50     0.70

For each parameter, Bulk_Reff and Tail_Reff are crude measures of relative
effective sample size for bulk and tail quantities respectively (good mixing
Reff > 0.1), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

All Rhats are close to 1, and relative efficiencies are good.

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates for tau.

plot_local_reff(fit = fit_ncp, par = "tau", nalpha = 20)

Small tau values are still more difficult to explore, but the relative efficiency is in a good range.

We examine also the relative efficiency of different quantile estimates.

plot_quantile_reff(fit = fit_ncp, par = "tau", nalpha = 40)

Small quantile estimates have lower but still useful efficiencies.

Rank plot visualisation of tau, which has the smallest relative efficiency 0.57 among parameters.

samp_ncp <- as.array(fit_ncp)
mcmc_hist_r_scale(samp_ncp[, , 2])

We can see problems in sampling small values of tau.

Rank plot visualisation of lp__, which has relative efficiency 0.01.

mcmc_hist_r_scale(samp_ncp[, , 19])

We observe clear problems sampling different energy levels.

As we observed clear problems we shouldn’t trust any Monte Carlo error estimates, but for illustration we show MCSE, interval and corresponding S_eff for median of mu and tau. MCSE is much smaller than with the problematic centered parameterization.

round(quantile_mcse(samp_ncp[ , , "mu"], prob = 0.5),2)
  mcse  q05  q95    Seff
1 0.09 4.29 4.55 4236.54
round(quantile_mcse(samp_ncp[ , , "tau"], prob = 0.5),2)
  mcse  q05  q95    Seff
1 0.06 2.67 2.86 3150.55

The proposed default output from Stan would look as follows, with all Rhat<1.01 and Reff>0.1.

monitor_extra(fit_ncp, warmup=0)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

                mean se_mean   sd     Q5   Q50   Q95 neff reff sneff zneff zsneff zsreff Rhat sRhat
mu              4.38    0.05 3.24  -0.98  4.41  9.52 4023 1.01  4036  4070   4083   1.02    1     1
tau             3.61    0.06 3.16   0.25  2.77  9.77 2684 0.67  2700  2293   2303   0.58    1     1
theta_tilde[1]  0.32    0.01 0.97  -1.31  0.35  1.88 4528 1.13  4566  4532   4571   1.14    1     1
theta_tilde[2]  0.12    0.01 0.92  -1.41  0.14  1.64 5742 1.44  5758  5754   5771   1.44    1     1
theta_tilde[3] -0.09    0.01 0.96  -1.62 -0.10  1.49 4929 1.23  4974  4919   4966   1.24    1     1
theta_tilde[4]  0.05    0.01 0.91  -1.43  0.03  1.51 5370 1.34  5436  5376   5442   1.36    1     1
theta_tilde[5] -0.16    0.01 0.91  -1.67 -0.17  1.35 4222 1.06  4269  4226   4273   1.07    1     1
theta_tilde[6] -0.07    0.01 0.95  -1.64 -0.08  1.48 5182 1.30  5195  5179   5192   1.30    1     1
theta_tilde[7]  0.36    0.02 0.97  -1.25  0.39  1.88 3885 0.97  3888  3893   3898   0.97    1     1
theta_tilde[8]  0.08    0.01 0.97  -1.51  0.07  1.68 4838 1.21  4853  4834   4848   1.21    1     1
theta[1]        6.27    0.09 5.60  -1.38  5.68 15.84 3504 0.88  3530  3767   3790   0.95    1     1
theta[2]        5.03    0.07 4.62  -2.29  4.88 12.83 4892 1.22  4928  4965   5002   1.25    1     1
theta[3]        3.95    0.09 5.24  -4.28  4.08 11.90 3796 0.95  3825  3971   4001   1.00    1     1
theta[4]        4.64    0.07 4.63  -2.74  4.66 12.09 4554 1.14  4577  4676   4699   1.17    1     1
theta[5]        3.63    0.07 4.54  -4.13  3.89 10.45 4126 1.03  4174  4258   4310   1.08    1     1
theta[6]        3.95    0.07 4.88  -4.11  4.19 11.30 4726 1.18  4815  4922   4965   1.24    1     1
theta[7]        6.28    0.07 4.94  -0.84  5.86 15.18 4416 1.10  4423  4524   4599   1.15    1     1
theta[8]        4.91    0.08 5.37  -3.24  4.77 13.52 4046 1.01  4066  4439   4461   1.12    1     1
lp__           -6.81    0.06 2.30 -11.06 -6.47 -3.68 1678 0.42  1684  1704   1711   0.43    1     1
               zRhat zsRhat zfsRhat zfsneff zfsreff medsneff medsreff madsneff madsreff
mu                 1      1       1    1775    0.44     4237     1.06     2343     0.59
tau                1      1       1    3132    0.78     3151     0.79     3171     0.79
theta_tilde[1]     1      1       1    2141    0.54     4489     1.12     2491     0.62
theta_tilde[2]     1      1       1    1984    0.50     5350     1.34     2406     0.60
theta_tilde[3]     1      1       1    2136    0.53     5397     1.35     2447     0.61
theta_tilde[4]     1      1       1    2008    0.50     4821     1.21     2419     0.60
theta_tilde[5]     1      1       1    2263    0.57     4282     1.07     2692     0.67
theta_tilde[6]     1      1       1    2078    0.52     4760     1.19     2412     0.60
theta_tilde[7]     1      1       1    2260    0.57     3823     0.96     2783     0.70
theta_tilde[8]     1      1       1    1905    0.48     4351     1.09     2000     0.50
theta[1]           1      1       1    2528    0.63     4358     1.09     2788     0.70
theta[2]           1      1       1    2302    0.58     4229     1.06     2542     0.64
theta[3]           1      1       1    2553    0.64     4002     1.00     2971     0.74
theta[4]           1      1       1    2654    0.66     4548     1.14     2859     0.71
theta[5]           1      1       1    2783    0.70     4523     1.13     2689     0.67
theta[6]           1      1       1    2666    0.67     4759     1.19     3152     0.79
theta[7]           1      1       1    2493    0.62     4315     1.08     2911     0.73
theta[8]           1      1       1    2472    0.62     4400     1.10     2644     0.66
lp__               1      1       1    2487    0.62     1990     0.50     2796     0.70

5.6.2.2 Non-centered parameterization default MCMC options plus adapt_delta=0.95

Next we examine the same model but with higher adapt_delta=0.95.

fit_ncp2 <- stan(
  file='eight_schools_ncp.stan', data=input_data,
  iter=2000, chains=4, control=list(adapt_delta = 0.95), 
  seed=483892929, refresh=0
)

We get zero divergences with adapt_delta = 0.95.

res <- monitor_extra(fit_ncp2)
round(res[, sel], 2)
Inference for the input samples ( chains: each with iter = ; warmup = ):

               sRhat zsRhat zfsRhat reff zsreff zfsreff medsreff madsreff
mu                 1      1       1 1.38   1.38    0.52     1.34     0.65
tau                1      1       1 0.87   0.72    0.84     0.84     0.79
theta_tilde[1]     1      1       1 1.23   1.26    0.46     1.30     0.51
theta_tilde[2]     1      1       1 1.03   1.04    0.50     1.00     0.58
theta_tilde[3]     1      1       1 1.60   1.62    0.52     1.41     0.59
theta_tilde[4]     1      1       1 1.51   1.52    0.47     1.37     0.58
theta_tilde[5]     1      1       1 1.39   1.40    0.54     1.26     0.58
theta_tilde[6]     1      1       1 1.21   1.21    0.47     1.20     0.59
theta_tilde[7]     1      1       1 1.18   1.20    0.53     1.19     0.60
theta_tilde[8]     1      1       1 1.52   1.54    0.49     1.33     0.56
theta[1]           1      1       1 1.10   1.23    0.64     1.09     0.71
theta[2]           1      1       1 1.22   1.28    0.63     1.23     0.61
theta[3]           1      1       1 1.29   1.36    0.70     1.32     0.68
theta[4]           1      1       1 1.12   1.17    0.64     1.13     0.68
theta[5]           1      1       1 1.30   1.34    0.68     1.32     0.74
theta[6]           1      1       1 1.35   1.42    0.60     1.39     0.68
theta[7]           1      1       1 1.11   1.18    0.66     1.15     0.69
theta[8]           1      1       1 1.18   1.23    0.64     1.19     0.69
lp__               1      1       1 0.40   0.41    0.65     0.53     0.77

For each parameter, Bulk_Reff and Tail_Reff are crude measures of relative
effective sample size for bulk and tail quantities respectively (good mixing
Reff > 0.1), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

All Rhats are close to 1, and relative efficiencies are good and slightly better than with the default adapt_delta.

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates for tau.

plot_local_reff(fit = fit_ncp2, par = "mu", nalpha = 20)

Small tau values are still more difficult to explore, but the relative efficiency is in a good range.

Check with a finer resolution

plot_local_reff(fit = fit_ncp2, par = "mu", nalpha = 50)

Also with a finer resolution the efficiency of MCMC is good for different values of tau.

We examine also the relative efficiency of different quantile estimates.

plot_quantile_reff(fit = fit_ncp2, par = "mu", nalpha = 40)

Rank plot visualisation of tau, which has the smallest relative efficiency 0.62 among parameters.

samp_ncp2 <- as.array(fit_ncp2)
mcmc_hist_r_scale(samp_ncp2[, , 2])

Higher adapt_delta value seems to have improved sampling of small values.

Rank plot visualisation of lp__, which has relative efficiency 0.44.

mcmc_hist_r_scale(samp_ncp2[, , 19])

Seems ok.

Now we should be able to trust Monte Carlo error estimates, and for illustration we show MCSE, interval and corresponding S_eff for median of mu and tau. The change compared to adapt_delta=0.8 is small.

round(quantile_mcse(samp_ncp2[ , , "mu"], prob = 0.5),2)
  mcse  q05 q95    Seff
1 0.05 4.32 4.5 5342.84
round(quantile_mcse(samp_ncp2[ , , "tau"], prob = 0.5),2)
  mcse  q05 q95    Seff
1 0.05 2.72 2.9 3358.93

The proposed default output from Stan would look as follows, with all Rhat<1.01 and Reff>0.1.

monitor_extra(fit_ncp2)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 1000):

                mean se_mean   sd     Q5   Q50   Q95 neff reff sneff zneff zsneff zsreff Rhat sRhat
mu              4.47    0.05 3.37  -1.14  4.42  9.96 5512 1.38  5544  5499   5531   1.38    1     1
tau             3.59    0.05 3.17   0.30  2.81  9.50 3491 0.87  3535  2846   2872   0.72    1     1
theta_tilde[1]  0.31    0.01 0.98  -1.29  0.31  1.88 4906 1.23  5043  4906   5046   1.26    1     1
theta_tilde[2]  0.10    0.01 0.94  -1.44  0.11  1.62 4101 1.03  4177  4099   4177   1.04    1     1
theta_tilde[3] -0.08    0.01 0.97  -1.64 -0.08  1.48 6406 1.60  6501  6390   6485   1.62    1     1
theta_tilde[4]  0.06    0.01 0.95  -1.51  0.08  1.62 6046 1.51  6104  6020   6076   1.52    1     1
theta_tilde[5] -0.16    0.01 0.94  -1.71 -0.17  1.39 5543 1.39  5608  5544   5608   1.40    1     1
theta_tilde[6] -0.06    0.01 0.97  -1.64 -0.06  1.53 4840 1.21  4864  4831   4855   1.21    1     1
theta_tilde[7]  0.37    0.01 0.96  -1.25  0.40  1.87 4720 1.18  4741  4773   4796   1.20    1     1
theta_tilde[8]  0.06    0.01 0.96  -1.54  0.07  1.66 6083 1.52  6140  6076   6142   1.54    1     1
theta[1]        6.25    0.08 5.58  -1.62  5.64 16.31 4394 1.10  4480  4792   4907   1.23    1     1
theta[2]        5.01    0.07 4.68  -2.42  4.82 12.96 4894 1.22  4951  5061   5122   1.28    1     1
theta[3]        4.09    0.07 5.27  -4.63  4.26 12.10 5156 1.29  5228  5382   5457   1.36    1     1
theta[4]        4.82    0.07 4.78  -2.73  4.75 12.48 4485 1.12  4511  4658   4695   1.17    1     1
theta[5]        3.72    0.06 4.62  -3.96  3.82 11.02 5216 1.30  5223  5333   5346   1.34    1     1
theta[6]        4.13    0.07 4.95  -3.88  4.27 11.44 5392 1.35  5430  5631   5670   1.42    1     1
theta[7]        6.38    0.08 5.10  -0.98  5.88 15.43 4439 1.11  4601  4643   4708   1.18    1     1
theta[8]        4.87    0.08 5.17  -3.23  4.78 13.13 4730 1.18  4727  4927   4924   1.23    1     1
lp__           -6.92    0.06 2.31 -11.19 -6.60 -3.78 1618 0.40  1629  1633   1641   0.41    1     1
               zRhat zsRhat zfsRhat zfsneff zfsreff medsneff medsreff madsneff madsreff
mu                 1      1       1    2087    0.52     5343     1.34     2585     0.65
tau                1      1       1    3360    0.84     3359     0.84     3163     0.79
theta_tilde[1]     1      1       1    1837    0.46     5185     1.30     2051     0.51
theta_tilde[2]     1      1       1    1988    0.50     3986     1.00     2305     0.58
theta_tilde[3]     1      1       1    2086    0.52     5655     1.41     2341     0.59
theta_tilde[4]     1      1       1    1870    0.47     5462     1.37     2317     0.58
theta_tilde[5]     1      1       1    2151    0.54     5051     1.26     2303     0.58
theta_tilde[6]     1      1       1    1899    0.47     4801     1.20     2346     0.59
theta_tilde[7]     1      1       1    2127    0.53     4741     1.19     2383     0.60
theta_tilde[8]     1      1       1    1972    0.49     5323     1.33     2234     0.56
theta[1]           1      1       1    2579    0.64     4375     1.09     2848     0.71
theta[2]           1      1       1    2530    0.63     4930     1.23     2451     0.61
theta[3]           1      1       1    2803    0.70     5299     1.32     2735     0.68
theta[4]           1      1       1    2542    0.64     4500     1.13     2711     0.68
theta[5]           1      1       1    2701    0.68     5293     1.32     2954     0.74
theta[6]           1      1       1    2418    0.60     5564     1.39     2719     0.68
theta[7]           1      1       1    2638    0.66     4593     1.15     2775     0.69
theta[8]           1      1       1    2559    0.64     4756     1.19     2757     0.69
lp__               1      1       1    2610    0.65     2137     0.53     3074     0.77

5.6.2.3 Non-centered parameterization default MCMC options + adapt_delta=0.95 + longer chains

If in doubt, we can run longer chains.

input_data <- read_rdump("eight_schools.data.R")
fit_ncp3 <- stan(
  file='eight_schools_ncp.stan', data=input_data,
  iter=20000, chains=4, control=list(adapt_delta = 0.954), 
  seed=483892929, refresh=0
)
res <- monitor_extra(fit_ncp3)
round(res[, sel], 2)
Inference for the input samples ( chains: each with iter = ; warmup = ):

               sRhat zsRhat zfsRhat reff zsreff zfsreff medsreff madsreff
mu                 1      1       1 1.04   1.04    0.56     1.06     0.66
tau                1      1       1 0.79   0.62    0.75     0.79     0.73
theta_tilde[1]     1      1       1 1.08   1.08    0.54     1.04     0.64
theta_tilde[2]     1      1       1 1.17   1.17    0.52     1.14     0.61
theta_tilde[3]     1      1       1 1.16   1.16    0.50     1.13     0.57
theta_tilde[4]     1      1       1 1.15   1.15    0.55     1.13     0.64
theta_tilde[5]     1      1       1 1.06   1.06    0.53     1.10     0.61
theta_tilde[6]     1      1       1 1.22   1.22    0.53     1.19     0.60
theta_tilde[7]     1      1       1 1.06   1.07    0.56     1.01     0.62
theta_tilde[8]     1      1       1 1.19   1.19    0.51     1.14     0.61
theta[1]           1      1       1 0.99   1.04    0.69     1.04     0.72
theta[2]           1      1       1 1.26   1.27    0.63     1.15     0.68
theta[3]           1      1       1 1.04   1.08    0.65     1.10     0.69
theta[4]           1      1       1 1.20   1.23    0.67     1.18     0.71
theta[5]           1      1       1 1.16   1.17    0.70     1.13     0.74
theta[6]           1      1       1 1.15   1.18    0.64     1.15     0.69
theta[7]           1      1       1 1.12   1.13    0.72     1.08     0.74
theta[8]           1      1       1 1.04   1.09    0.65     1.10     0.70
lp__               1      1       1 0.42   0.42    0.69     0.51     0.77

For each parameter, Bulk_Reff and Tail_Reff are crude measures of relative
effective sample size for bulk and tail quantities respectively (good mixing
Reff > 0.1), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

All Rhats are close to 1, and relative efficiencies are good. Relative efficiencies are similar as for shorter chain, which means that running longer chains is giving us higher effective sample size.

We examine the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates for tau.

plot_local_reff(fit = fit_ncp3, par = "tau", nalpha = 100)

Small tau values are still more difficult to explore, but the relative efficiency is in a good range.

We examine also the relative efficiency of different quantile estimates.

plot_quantile_reff(fit = fit_ncp3, par = "tau", nalpha = 100)

Rank plot visualisation of tau, which has the smallest relative efficiency of 0.62 among parameters.

samp_ncp3 <- as.array(fit_ncp3)
mcmc_hist_r_scale(samp_ncp3[, , "tau"])

Nothing special.

Rank plot visualisation of lp__, which has a relative efficiency of 0.42.

mcmc_hist_r_scale(samp_ncp3[, , "lp__"])

Nothing special.

Now we should be able to trust Monte Carlo error estimates, and for illustration we show MCSE, interval and corresponding S_eff for median of mu and tau. Longer chains have reduced MCSE as we would expect for well nehaving chains.

round(quantile_mcse(samp_ncp3[ , , "mu"], prob = 0.5), 2)
  mcse  q05  q95     Seff
1 0.02 4.38 4.45 42277.74
round(quantile_mcse(samp_ncp3[ , , "tau"], prob = 0.5), 2)
  mcse  q05  q95     Seff
1 0.02 2.69 2.76 31721.59

The proposed default output from Stan would look as follows, with all Rhat<1.01 and Reff>0.1.

monitor_extra(fit_ncp3)
Inference for the input samples (4 chains: each with iter = 20000; warmup = 10000):

                mean se_mean   sd     Q5   Q50   Q95  neff reff sneff zneff zsneff zsreff Rhat
mu              4.39    0.02 3.31  -1.09  4.42  9.81 41499 1.04 41588 41604  41693   1.04    1
tau             3.57    0.02 3.18   0.24  2.73  9.73 31663 0.79 31684 24652  24674   0.62    1
theta_tilde[1]  0.31    0.00 0.99  -1.34  0.32  1.93 43299 1.08 43353 43339  43394   1.08    1
theta_tilde[2]  0.10    0.00 0.93  -1.46  0.10  1.64 46845 1.17 46929 46855  46939   1.17    1
theta_tilde[3] -0.09    0.00 0.97  -1.68 -0.10  1.51 46487 1.16 46509 46517  46539   1.16    1
theta_tilde[4]  0.06    0.00 0.94  -1.49  0.07  1.59 46046 1.15 46077 46104  46135   1.15    1
theta_tilde[5] -0.16    0.00 0.92  -1.67 -0.16  1.36 42442 1.06 42482 42472  42512   1.06    1
theta_tilde[6] -0.07    0.00 0.94  -1.62 -0.08  1.49 48847 1.22 48863 48855  48871   1.22    1
theta_tilde[7]  0.35    0.00 0.96  -1.26  0.36  1.91 42458 1.06 42530 42604  42677   1.07    1
theta_tilde[8]  0.08    0.00 0.98  -1.52  0.09  1.67 47683 1.19 47741 47738  47796   1.19    1
theta[1]        6.18    0.03 5.55  -1.52  5.61 16.07 39714 0.99 39823 41376  41514   1.04    1
theta[2]        4.89    0.02 4.68  -2.51  4.82 12.61 50581 1.26 50663 50682  50761   1.27    1
theta[3]        3.90    0.03 5.26  -4.86  4.17 11.72 41573 1.04 41620 43083  43134   1.08    1
theta[4]        4.74    0.02 4.72  -2.71  4.68 12.31 48078 1.20 48115 49159  49196   1.23    1
theta[5]        3.63    0.02 4.63  -4.25  3.87 10.65 46269 1.16 46303 46640  46676   1.17    1
theta[6]        4.04    0.02 4.83  -3.99  4.20 11.44 46136 1.15 46151 47355  47372   1.18    1
theta[7]        6.28    0.02 5.09  -0.98  5.79 15.52 44647 1.12 44696 45105  45195   1.13    1
theta[8]        4.89    0.03 5.29  -3.28  4.78 13.30 41749 1.04 41797 43734  43769   1.09    1
lp__           -6.95    0.02 2.35 -11.27 -6.63 -3.72 16872 0.42 16885 16813  16825   0.42    1
               sRhat zRhat zsRhat zfsRhat zfsneff zfsreff medsneff medsreff madsneff madsreff
mu                 1     1      1       1   22542    0.56    42278     1.06    26494     0.66
tau                1     1      1       1   29890    0.75    31722     0.79    29172     0.73
theta_tilde[1]     1     1      1       1   21800    0.54    41488     1.04    25544     0.64
theta_tilde[2]     1     1      1       1   20881    0.52    45573     1.14    24265     0.61
theta_tilde[3]     1     1      1       1   19949    0.50    45366     1.13    22742     0.57
theta_tilde[4]     1     1      1       1   21892    0.55    45260     1.13    25414     0.64
theta_tilde[5]     1     1      1       1   21075    0.53    43822     1.10    24200     0.61
theta_tilde[6]     1     1      1       1   21025    0.53    47417     1.19    24145     0.60
theta_tilde[7]     1     1      1       1   22302    0.56    40482     1.01    24866     0.62
theta_tilde[8]     1     1      1       1   20515    0.51    45709     1.14    24325     0.61
theta[1]           1     1      1       1   27517    0.69    41503     1.04    28846     0.72
theta[2]           1     1      1       1   25205    0.63    46115     1.15    27361     0.68
theta[3]           1     1      1       1   25988    0.65    43875     1.10    27778     0.69
theta[4]           1     1      1       1   26656    0.67    47042     1.18    28542     0.71
theta[5]           1     1      1       1   28074    0.70    45153     1.13    29616     0.74
theta[6]           1     1      1       1   25715    0.64    46137     1.15    27577     0.69
theta[7]           1     1      1       1   28812    0.72    43340     1.08    29411     0.74
theta[8]           1     1      1       1   25897    0.65    44080     1.10    28126     0.70
lp__               1     1      1       1   27667    0.69    20502     0.51    30713     0.77

5.7 Appendix G: Dynamic HMC and effective sample size

We have already seen that the relative efficiency of dynamic can be higher than with independent draws. The next example illustrates interesting relative efficiency phenomena due to properties of the dynamic HMC algorithms.

We sample from a simple 16-dimensional unit normal model.

writeLines(readLines("normal.stan"))
data {
  int<lower=1> J;
}
parameters {
  vector[J] x;
}
model {
  x ~ normal(0, 1);
}
fit_n <- stan(
  file='normal.stan', data=data.frame(J=16),
  iter=20000, chains=4, seed=483892929, refresh=0
)
samp <- as.array(fit_n)
res <- monitor_extra(samp)
print(res)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):

       mean se_mean   sd     Q5   Q50   Q95   neff reff  sneff  zneff zsneff zsreff Rhat sRhat
x[1]   0.00    0.00 1.00  -1.66  0.00  1.65  97846 2.45  98426  97687  98264   2.46    1     1
x[2]   0.00    0.00 1.00  -1.64 -0.01  1.64  95437 2.39  95717  95531  95812   2.40    1     1
x[3]   0.00    0.00 0.99  -1.63  0.00  1.62  98375 2.46  98703  98313  98640   2.47    1     1
x[4]   0.01    0.00 1.01  -1.65  0.00  1.66  96642 2.42  97290  96654  97302   2.43    1     1
x[5]   0.00    0.00 1.00  -1.64  0.00  1.63 101317 2.53 101514 101344 101542   2.54    1     1
x[6]   0.00    0.00 1.00  -1.65  0.00  1.65  96011 2.40  96299  96003  96292   2.41    1     1
x[7]   0.00    0.00 0.99  -1.63  0.01  1.63  95558 2.39  96082  95492  96016   2.40    1     1
x[8]   0.00    0.00 1.00  -1.65 -0.01  1.65  99937 2.50 100392  99920 100375   2.51    1     1
x[9]   0.00    0.00 1.00  -1.64  0.01  1.65 100824 2.52 101134 100831 101141   2.53    1     1
x[10]  0.00    0.00 0.99  -1.62 -0.01  1.63 102178 2.55 102983 102317 103126   2.58    1     1
x[11]  0.00    0.00 1.00  -1.65  0.01  1.66  95226 2.38  95863  95250  95886   2.40    1     1
x[12]  0.01    0.00 0.99  -1.62  0.00  1.63  97828 2.45  98357  97903  98433   2.46    1     1
x[13]  0.00    0.00 0.99  -1.62  0.01  1.65  97666 2.44  98166  97682  98181   2.45    1     1
x[14]  0.00    0.00 0.99  -1.63  0.00  1.63  96805 2.42  97327  96792  97313   2.43    1     1
x[15]  0.01    0.00 0.99  -1.63  0.01  1.64  94911 2.37  95185  94950  95223   2.38    1     1
x[16]  0.00    0.00 1.01  -1.66  0.00  1.65  99541 2.49 100111  99413  99980   2.50    1     1
lp__  -7.95    0.02 2.79 -13.00 -7.66 -3.92  14922 0.37  14934  14480  14489   0.36    1     1
      zRhat zsRhat zfsRhat zfsneff zfsreff medsneff medsreff madsneff madsreff
x[1]      1      1       1   16450    0.41    82432     2.06    19205     0.48
x[2]      1      1       1   16462    0.41    75494     1.89    19208     0.48
x[3]      1      1       1   16152    0.40    78630     1.97    18732     0.47
x[4]      1      1       1   16075    0.40    81148     2.03    19079     0.48
x[5]      1      1       1   16785    0.42    79953     2.00    20116     0.50
x[6]      1      1       1   16578    0.41    79018     1.98    19626     0.49
x[7]      1      1       1   17109    0.43    81690     2.04    19543     0.49
x[8]      1      1       1   16400    0.41    79263     1.98    18828     0.47
x[9]      1      1       1   15890    0.40    81119     2.03    18683     0.47
x[10]     1      1       1   16460    0.41    76948     1.92    19242     0.48
x[11]     1      1       1   15969    0.40    79164     1.98    18329     0.46
x[12]     1      1       1   15294    0.38    81841     2.05    18720     0.47
x[13]     1      1       1   15370    0.38    80615     2.02    18210     0.46
x[14]     1      1       1   16452    0.41    77592     1.94    19050     0.48
x[15]     1      1       1   16651    0.42    80406     2.01    19622     0.49
x[16]     1      1       1   16395    0.41    82347     2.06    18845     0.47
lp__      1      1       1   21484    0.54    17074     0.43    23603     0.59

Bulk-\(R_{\rm eff}\) for all \(x\) is larger than 2.38. However tail-\(R_{\rm eff}\) for all \(x\) is less than 0.43, and bulk-\(R_{\rm eff}\) for lp__ is only 0.36. If we take a look at all the Stan examples in this notebook, we see that the bulk-\(R_{\rm eff}\) for lp__ is always below 0.5. lp__ correlates strongly with the total energy in HMC, and that total energy is sampled using random walk proposal once per iteration and thus it’s likely that lp__ has some random walk behavior leading to autocorrelation and relative efficiency below 1. On the other hand, No-U-Turn behavior and sampling from the trajectory may make the Markov chain to be antithetic with negative odd lag correlation and the bulk relative efficiency higher than 1 for some parameters.

Let’s check the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates for x[1].

plot_local_reff(fit_n, par = 1, nalpha = 100)

The relative efficiency for probability estimate for a small interval is close to 1 with a slight drop in the tails. This is a good result, but far from the relative efficiency for the bulk, mean, and median estimates.

Let’s check the relative efficiency of quantiles.

plot_quantile_reff(fit = fit_n, par = 1, nalpha = 100)

Central quantile estimates have higher relative efficiency than tail quantile estimates.

The total energy of HMC should affect how far in the tails a chain in one iteration can go. Fat tails of the target have high energy, and thus only chains with high total energy can reach there. This will suggest that the random walk in total energy would cause random walk in variance of x. Let’s check the second moment of x.

samp <- as.array(fit_n, pars = "x")^2
res <- monitor_extra(samp)
print(res)
Inference for the input samples (4 chains: each with iter = 10000; warmup = 0):

      mean se_mean   sd Q5  Q50  Q95  neff reff sneff zneff zsneff zsreff Rhat sRhat zRhat zsRhat
x[1]  1.01    0.01 1.44  0 0.46 3.85 14657 0.37 14664 16440  16443   0.41    1     1     1      1
x[2]  0.99    0.01 1.42  0 0.44 3.80 15511 0.39 15518 16488  16492   0.41    1     1     1      1
x[3]  0.98    0.01 1.39  0 0.45 3.80 14684 0.37 14703 16133  16148   0.40    1     1     1      1
x[4]  1.01    0.01 1.46  0 0.45 3.95 14510 0.36 14518 16039  16070   0.40    1     1     1      1
x[5]  1.00    0.01 1.42  0 0.45 3.86 15017 0.38 15029 16766  16785   0.42    1     1     1      1
x[6]  1.00    0.01 1.42  0 0.45 3.91 14368 0.36 14380 16547  16572   0.41    1     1     1      1
x[7]  0.99    0.01 1.39  0 0.45 3.74 15464 0.39 15488 17080  17097   0.43    1     1     1      1
x[8]  1.00    0.01 1.42  0 0.46 3.80 14773 0.37 14766 16397  16397   0.41    1     1     1      1
x[9]  1.00    0.01 1.41  0 0.45 3.81 13439 0.34 13467 15910  15922   0.40    1     1     1      1
x[10] 0.98    0.01 1.39  0 0.44 3.73 14654 0.37 14672 16430  16461   0.41    1     1     1      1
x[11] 1.00    0.01 1.41  0 0.46 3.85 15068 0.38 15098 15997  16008   0.40    1     1     1      1
x[12] 0.99    0.01 1.41  0 0.45 3.75 14215 0.36 14221 15342  15368   0.38    1     1     1      1
x[13] 0.98    0.01 1.38  0 0.44 3.83 13548 0.34 13547 15368  15371   0.38    1     1     1      1
x[14] 0.98    0.01 1.37  0 0.45 3.75 14547 0.36 14565 16418  16461   0.41    1     1     1      1
x[15] 0.98    0.01 1.38  0 0.45 3.77 15417 0.39 15414 16652  16655   0.42    1     1     1      1
x[16] 1.01    0.01 1.41  0 0.47 3.86 15551 0.39 15561 16390  16400   0.41    1     1     1      1
      zfsRhat zfsneff zfsreff medsneff medsreff madsneff madsreff
x[1]        1   18445    0.46    19172     0.48    23268     0.58
x[2]        1   19699    0.49    19309     0.48    24908     0.62
x[3]        1   18532    0.46    18694     0.47    23865     0.60
x[4]        1   18983    0.47    19156     0.48    23907     0.60
x[5]        1   19535    0.49    20102     0.50    25043     0.63
x[6]        1   17535    0.44    19596     0.49    22613     0.57
x[7]        1   19019    0.48    19555     0.49    23336     0.58
x[8]        1   18920    0.47    18816     0.47    24195     0.60
x[9]        1   18386    0.46    18674     0.47    22198     0.55
x[10]       1   18570    0.46    19147     0.48    23900     0.60
x[11]       1   19482    0.49    18393     0.46    24709     0.62
x[12]       1   18691    0.47    18588     0.46    23963     0.60
x[13]       1   18506    0.46    18258     0.46    24132     0.60
x[14]       1   18823    0.47    19062     0.48    23428     0.59
x[15]       1   19633    0.49    19620     0.49    24476     0.61
x[16]       1   20083    0.50    18829     0.47    24441     0.61

Mean of bulk-\(R_{\rm eff}\) for \(x_j^2\) is 0.41, which is quite close to bulk-\(R_{\rm eff}\) for lp__. This is not that surprising as the potential energy in normal model is relative to \(\sum_{j=1}^J x_j^2\).

Let’s check the relative efficiency in different parts of the posterior by computing the relative efficiency of small interval probability estimates for x[1]^2.

plot_local_reff(fit = samp, par = 1, nalpha = 100)

The relative efficiency is mostly a bit below 1, but for the right tail of \(x_1^2\) the relative efficiency drops. This likely due to only some iterations have high enough total energy to obtain draws from the high energy part of the tail.

Let’s check the relative efficiency of quantiles.

plot_quantile_reff(fit = samp, par = 1, nalpha = 100)

We can see the correlation between lp__ and magnitude of x[1] in the following plot.

samp <- as.array(fit_n)
qplot(
  as.vector(samp[, , "lp__"]),
  abs(as.vector(samp[, , "x[1]"]))
) + 
  xlab('lp__') + 
  ylab('x[1]')

Low lp__ corresponds to high energy and more variation in x[1], and high lp__ corresponds to low energy and small variation in x[1]. Finally \(\sum_{j=1}^J x_j^2\) is perfectly correlated with lp__.

qplot(
  as.vector(samp[, , "lp__"]),
  as.vector(apply(samp[, , 1:16]^2, 1:2, sum))
) + 
  xlab('lp__') + 
  ylab('sum(x^2)')

This shows that even if we get high relative efficiency estimates for central quantities (like mean and median), it is important to look at relative efficiency of scale and tail quantities, too. The relative efficiency of lp__ can also indicate problems of sampling in tails.

References

Betancourt, M. (2017) ‘A conceptual introduction to hamiltonian monte carlo’, arXiv preprint arXiv:1701.02434.

Brooks, S. P. and Gelman, A. (1998) ‘General methods for monitoring convergence of iterative simulations’, Journal of Computational and Graphical Statistics, 7(4), pp. 434–455.

Gelman, A. et al. (2013) Bayesian data analysis, third edition. CRC Press.

Gelman, A. and Rubin, D. B. (1992) ‘Inference from iterative simulation using multiple sequences’, Statistical science, 7(4), pp. 457–472.

Geyer, C. J. (1992) ‘Practical Markov chain Monte Carlo’, Statistical Science, 7, pp. 473–483.

Geyer, C. J. (2011) ‘Introduction to Markov chain Monte Carlo’, in Brooks, S. et al. (eds) Handbook of markov chain monte carlo. CRC Press.

Hoffman, M. D. and Gelman, A. (2014) ‘The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo’, Journal of Machine Learning Research, 15, pp. 1593–1623. Available at: http://jmlr.org/papers/v15/hoffman14a.html.

Stan Development Team (2018a) Bayesian statistics using stan. Stan Development Team. Available at: https://github.com/stan-dev/stan-book.

Stan Development Team (2018b) ‘RStanArm: Bayesian applied regression modeling via Stan. R package version 2.17.4’. Available at: http://mc-stan.org.

Stan Development Team (2018c) ‘RStan: The R interface to Stan. R package version 2.17.3’. Available at: http://mc-stan.org.

Stan Development Team (2018d) ‘Stan modeling language users guide and reference manual. Version 2.18.0’. Available at: http://mc-stan.org.

Stan Development Team (2018e) ‘The Stan core library version 2.18.0’. Available at: http://mc-stan.org.

Original Computing Environment

makevars <- file.path(Sys.getenv("HOME"), ".R/Makevars")
if (file.exists(makevars)) {
  writeLines(readLines(makevars)) 
}
devtools::session_info("rstan")
Session info --------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 system   x86_64, mingw32             
 ui       RTerm                       
 language (EN)                        
 collate  German_Germany.1252         
 tz       Europe/Berlin               
 date     2018-11-09                  
Packages ------------------------------------------------------------------------------------------
 package      * version   date       source                          
 assertthat     0.2.0     2017-04-11 CRAN (R 3.5.0)                  
 backports      1.1.2     2017-12-13 CRAN (R 3.5.0)                  
 base64enc      0.1-3     2015-07-28 CRAN (R 3.5.0)                  
 BH             1.66.0-1  2018-02-13 CRAN (R 3.5.0)                  
 callr          3.0.0     2018-08-24 CRAN (R 3.5.1)                  
 cli            1.0.1     2018-09-25 CRAN (R 3.5.1)                  
 colorspace     1.3-2     2016-12-14 CRAN (R 3.5.0)                  
 compiler       3.5.1     2018-07-02 local                           
 crayon         1.3.4     2017-09-16 CRAN (R 3.5.0)                  
 desc           1.2.0     2018-05-01 CRAN (R 3.5.0)                  
 digest         0.6.18    2018-10-10 CRAN (R 3.5.1)                  
 fansi          0.4.0     2018-10-05 CRAN (R 3.5.1)                  
 ggplot2      * 3.0.0     2018-07-03 CRAN (R 3.5.1)                  
 glue           1.3.0     2018-07-17 CRAN (R 3.5.1)                  
 graphics     * 3.5.1     2018-07-02 local                           
 grDevices    * 3.5.1     2018-07-02 local                           
 grid           3.5.1     2018-07-02 local                           
 gridExtra    * 2.3       2017-09-09 CRAN (R 3.5.0)                  
 gtable         0.2.0     2016-02-26 CRAN (R 3.5.0)                  
 inline         0.3.15    2018-05-18 CRAN (R 3.5.1)                  
 labeling       0.3       2014-08-23 CRAN (R 3.5.0)                  
 lattice        0.20-35   2017-03-25 CRAN (R 3.5.1)                  
 lazyeval       0.2.1     2017-10-29 CRAN (R 3.5.0)                  
 loo            2.0.0     2018-04-11 CRAN (R 3.5.1)                  
 magrittr       1.5       2014-11-22 CRAN (R 3.5.0)                  
 MASS           7.3-50    2018-04-30 CRAN (R 3.5.1)                  
 Matrix         1.2-14    2018-04-13 CRAN (R 3.5.1)                  
 matrixStats    0.54.0    2018-07-23 CRAN (R 3.5.1)                  
 methods      * 3.5.1     2018-07-02 local                           
 mgcv           1.8-25    2018-09-21 local                           
 munsell        0.5.0     2018-06-12 CRAN (R 3.5.1)                  
 nlme           3.1-137   2018-04-07 CRAN (R 3.5.1)                  
 parallel       3.5.1     2018-07-02 local                           
 pillar         1.3.0     2018-07-14 CRAN (R 3.5.0)                  
 pkgbuild       1.0.2     2018-10-16 CRAN (R 3.5.1)                  
 plyr           1.8.4     2016-06-08 CRAN (R 3.5.0)                  
 prettyunits    1.0.2     2015-07-13 CRAN (R 3.5.1)                  
 processx       3.2.0     2018-08-16 CRAN (R 3.5.1)                  
 ps             1.2.0     2018-10-16 CRAN (R 3.5.1)                  
 R6             2.3.0     2018-10-04 CRAN (R 3.5.1)                  
 RColorBrewer   1.1-2     2014-12-07 CRAN (R 3.5.0)                  
 Rcpp           0.12.19   2018-10-01 CRAN (R 3.5.1)                  
 RcppEigen      0.3.3.4.0 2018-02-07 CRAN (R 3.5.0)                  
 reshape2       1.4.3     2017-12-11 CRAN (R 3.5.0)                  
 rlang          0.3.0     2018-10-22 CRAN (R 3.5.1)                  
 rprojroot      1.3-2     2018-01-03 CRAN (R 3.5.0)                  
 rstan        * 2.18.2    2018-11-02 local                           
 scales         1.0.0     2018-08-09 CRAN (R 3.5.1)                  
 StanHeaders  * 2.18.0    2018-10-07 CRAN (R 3.5.1)                  
 stats        * 3.5.1     2018-07-02 local                           
 stats4         3.5.1     2018-07-02 local                           
 stringi        1.2.4     2018-07-20 CRAN (R 3.5.1)                  
 stringr      * 1.3.1     2018-05-10 CRAN (R 3.5.1)                  
 tibble       * 1.4.2     2018-01-22 CRAN (R 3.5.0)                  
 tools          3.5.1     2018-07-02 local                           
 utf8           1.1.4     2018-05-24 CRAN (R 3.5.1)                  
 utils        * 3.5.1     2018-07-02 local                           
 viridisLite    0.3.0     2018-02-01 CRAN (R 3.5.0)                  
 withr          2.1.2     2018-04-26 Github (jimhester/withr@79d7b0d)